0% found this document useful (0 votes)
58 views

AS Lecture Note 2021

This document provides an introduction to the Applied Statistics course. It discusses the following key points: 1. The course will equip students with basic statistical skills to identify appropriate techniques for problems and make informed decisions. Students will understand basic statistical theories and apply statistical methods to analyze, present, and interpret data. 2. The syllabus covers topics such as sampling methods, probability distributions, hypothesis testing, regression, and more. 3. Assessments include individual assignments worth 60% and an end of term exam worth 40%. Intended learning outcomes and grade descriptors are provided separately. 4. The introduction defines statistics, discusses who uses statistics, and distinguishes between census and survey methods of data collection.

Uploaded by

Pik ki Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views

AS Lecture Note 2021

This document provides an introduction to the Applied Statistics course. It discusses the following key points: 1. The course will equip students with basic statistical skills to identify appropriate techniques for problems and make informed decisions. Students will understand basic statistical theories and apply statistical methods to analyze, present, and interpret data. 2. The syllabus covers topics such as sampling methods, probability distributions, hypothesis testing, regression, and more. 3. Assessments include individual assignments worth 60% and an end of term exam worth 40%. Intended learning outcomes and grade descriptors are provided separately. 4. The introduction defines statistics, discusses who uses statistics, and distinguishes between census and survey methods of data collection.

Uploaded by

Pik ki Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 133

AS 2021-22

Applied Statistics
This course focuses on providing students with basic statistical knowledge. It will equip them
with the essential statistical skills to identify and apply appropriate techniques to various
problems, as well as to make informed decisions. Upon completion of the course, students
should understand the basic statistical theories and will be able to apply, to analyse, present and
interpret data using basic statistical methods.

Syllabus
1. Sampling Methods
2. Statistical Measures and Data Presentation
3. Probability
4. Probability distributions and expectation
5. Normal distribution
6. Sampling Distributions and Central Limit Theorem
7. Estimation
8. Hypothesis Testing
9. Analysis of Variance
10. Chi square test
11. Linear Regression

References
1. Allen G. Bluman (2013), Elementary Statistics: A Step by Step Approach, 9th Edition, McGraw Hill.
2. Berenson, Mark L., Levine, David M. & Szabet, Kathryn A. (2015), Basic Business Statistics:
Concept and Applications, 13th edition, Pearson Education Limited.

Assessments
Individual assignments 60%
End of term assessment 40%

Intended Learning Outcomes and Grade Descriptors


A document with the set of intended learning outcomes and grade descriptors of this course
has been uploaded to SOUL course link for your reference.

Class Lecturer
Name:

Contact email:

SOUL account: Course link: CCMA4009- ____________________

Class link: CCMA4009-____________________

1
Applied Statistics

General Reminders:
1. Remember to bring a HKEA approved calculator with SD (statistics) and REG (regression)
functions (No graphical display), to classes and assessment.
2. Check SOUL course link and class link frequently for updated information about the
course and class management.
3. Correct your answer to 4 decimal places when necessary.

2
Applied Statistics

Introduction
What is Statistics?

Statistics is the science that processes and analyzes data in order to produce meaningful
information. Careful use of statistical methods will enable us to obtain accurate information
from data. These methods include (1) carefully defining the situation, (2) gathering data, (3)
accurately summarizing the data, and (4) deriving and communicating meaningful conclusions.

Statistics involves information, methods to summarize this information, and their interpretation.
The field of statistics can be roughly subdivided into two areas: descriptive statistics and
inferential statistics. Descriptive Statistics focus on data collection (e.g. Methodology), data
presentation (e.g. Charts & Tables) and description of sample data (e.g. Average & Standard
deviation). Inferential Statistics estimate the population characteristics by means of the
sample statistics (e.g. Average) and uncover other useful information of the population (e.g. the
relationship between income and expense).

Who uses statistics?

Read the following examples and you would see that statistics is being used in almost every
different area:

(i) As a business student, you are required to review the sales of a new product, green tea ice-
cream. One way to review the sales is to refer to the number of boxes of green tea ice-
cream sold in a week in supermarkets. Furthermore, you need to compare the sales of
green tea ice-cream to the sales of chocolate ice-cream.

(ii) As a marketing student, you are asked to review the effectiveness of a series of promotion
in a shopping mall. One measurement is the comment (like / dislike) of teenagers.
Furthermore, you need to compare the response made by male and female teenagers.

(iii) As an aviation student, you are working on the travel control team in a summer project.
You are asked to check the delay time of flights arriving Hong Kong International Airport.
Furthermore, you are asked to compare the delay time of flights departed from different
countries.

It is much harder to name a field in which statistics is not used than it


is to name one in which statistics plays an integral part.

3
Applied Statistics

Chapter 1 Sampling Methods


In this lecture, you will learn about various types of data and different ways of selecting random
samples. Important terminologies in statistics will also be introduced.

Major learning objectives of this Chapter:


➢ Understand the difference between doing census and sample survey
➢ Be able to collect a good representative sample by using simple random sampling method,
systematic sampling method, stratified sampling method, and cluster sampling method
➢ Be able to classify variable as quantitative or qualitative

Gathering Data
There are many ways to gather data. One can collect primary data through observations, doing
experiments, conducting surveys. We can also collect secondary data by searching existing
information through publications or previous researches. In this chapter, we will focus on how
to collect primary data by conducting survey.

Census and Survey

When we talk about collecting data, usually we do not aim at collecting one piece of data.
Instead, we are talking about a large scale data collection based on our research objective.
Questionnaire is usually designed with a standardized set of questions to be asked. Once the
questionnaire is designed we need to define who /which is the most suitable person / unit to
response to the questionnaire? Subject / item / element is the target information provider of the
specified research objective. Referring to the previous examples, we would define the subject
of each research as:

(i) Every supermarket which sells the green tea ice-cream


(ii) Every teenager
(iii) Every flight arriving Hong Kong International Airport

If data is collected from every subject of the population, this is known as a census. When the
population is small, this could be a straightforward exercise. When the population size becomes
larger, taking a census can be very time-consuming. When the population becomes very large,
it is not possible to survey every member. Also, in some situation, it is not possible to survey
every member. For example, how would you interview every teenager in Hong Kong?

When the data collection process covers less than 100% of the population, it is known as a sample
survey. Sample data can be obtained relatively cheaper and quicker and if the sample is a good
representative of the population, a sample survey can give an accurate indication of the
population characteristic being studied.

Let’s have a simple comparison between conducting census and survey. Imagine there are 500
supermarkets selling the green tea ice-cream. You are required to collect information about the
number of boxes of green tea ice-cream sold in a supermarket during a week. If you do a census,
you need to visit all 500 supermarkets and keep a record of the number of boxes of green tea ice-
cream sold in each of the supermarkets, and by the end of the census you would prepare a database
as in an excel file format like this:

4
Applied Statistics

Supermarket Supermarket Supermarket Number of Number of boxes


ID Address Phone number boxes of green of chocolate ice-
tea ice-cream cream sold
sold
1 … … 82 102
2 … … 75 69
3 … … 65 87
4 … … 125 136
5 … … 87 92
… … … … …
… … … … …
499 … … 102 185
500 … … 65 63

After the census, data analysis would be conducted in order to answer the research objective.
Simple analysis such as the calculation of population mean and population standard deviation
would be the starting point for the analysis of the numerical dataset. (Do you know what it
means if the population mean number of boxes of green tea ice-cream sold is 75 with the
population standard deviation is 18?)

If you cannot do a census due to any reason (time limitation, budget problem, …), then you may
end up with doing a survey. For example, you do a survey with sample size n = 30. In this
case, you will only collect data from 30 supermarkets and keep a record for the collected data
like this:

Sample Supermarket Supermarket Supermarket Number of Number of


number ID Address Phone number boxes of boxes of
green tea ice- chocolate ice-
cream sold cream sold
1 3 … … 65 87
2 18 … … 52 96
3 27 … … 85 104
4 … … … … …
5 … … … … …
… … … … … …
… … … … … …
29 … … … … …
30 499 … … 102 185

After doing survey, we would also want to do analysis from the collected data in order to draw
conclusion about the research objective. However, as the data collection is incomplete, we need
to be very careful when we try to make the conclusion. The reliability of the conclusion from a
sample survey very much depends on how good is the sample as a representative of the population.
Put it simple, when doing census, 100% of the data is collected. You can analyze the data to
explain the situation with no missing information. When a survey is conducted, you have some
information to study. However, there may be bias between the sample statistics and the
population parameter. So when we do a survey, try to:

1. Select a good representative sample to avoid bias based on subjective selection.


2. Understand the relationship between sample statistics and population parameter and use the
sample statistics to estimate the unknown population parameter.
5
Applied Statistics

Sampling Methods

There are many sampling methods, which can be grouped into two categories: random and non-
random.

We shall consider a few types of each of these categories:

(I) Random sampling (II) Non-random sampling


Every subject should have a chance to be When practically no sampling frame is ready,
selected. An updated sampling frame is the selection of subject can only be conducted
needed, which consists the contact in a convenience way. There is always a risk
information of every subject in the population. of bias due to the subjective selection of
interviewee.
(Ia) Simple Random Sampling (IIa) Convenience Sampling
(Ib) Systematic Sampling (IIb) Quota Sampling
(Ic) Stratified Random Sampling
(Id) Cluster Sampling

(Ia) Simple Random Sampling


In simple random sampling, we use an unsystematic random selection process i.e. we identify
every subject in the population then choose them on some planned basis ensuring that every
subject has the same opportunity of being selected. Simple lucky draw is a typical simple
random sampling.

(Ib) Systematic sampling


Systematic random sampling is done through some ordered criteria by choosing subjects from a
randomly arranged sampling frame. You choose “1 from every k” subject in the population,
𝑁
where k is the ratio between population size and sample size, i.e. 𝑘 = 𝑛 . For example, unique
identity number is assigned to the 500 supermarkets from 001, 002, …, 500. As 30
500
supermarkets will be selected, then k = . We select 1 supermarket from about every 17
30
supermarkets from the sampling frame.

(Ic) Stratified random sampling


Sometimes, we group subjects into strata (subgroups with very different behavior to the research
objective) such as supermarkets in high income district, middle income district and low income
district. In order to result as a fair sample to different subgroups, we need to ensure the ratio
between different subgroups in the sample should be the same as that in the population. Then
random sampling in each subgroup will be done separately.

(Id) Cluster sampling


Sometimes, subjects are naturally divided into clusters so that each cluster is a small scale
representation of the total population. Random sampling of clusters instead of individual
subjects in the population is conducted. All subjects in the selected clusters are involved in the
survey. An example is the study of the household characteristics in Kowloon Bay. Subject of
the study should be individual household in Kowloon Bay. In the view of a very large
population and the preparation of the sampling frame is very challenging, households are grouped
into residential buildings. When all buildings information is updated as the sampling frame, the
random selection of buildings would be conducted. All households in the selected buildings
will then receive questionnaire to provide household information for the survey.

6
Applied Statistics

(IIa) Convenience sampling


In order to draw random samples, there must be an updated sampling frame which keeps record
of every subject in the population. Practically there are many situations that keeping such
sampling frames are quite impossible. Without a sampling frame, you may still select subjects
based on their convenient accessibility. Street interview is a typical example of convenience
sampling. However, the disadvantage of convenience sampling is that it is completely non-
random. There is real possibility of bias in the selection process, with the interviewer selecting
those easiest to question, perhaps those who look more co-operative.

(IIb) Quota sampling


In quota sampling, you select sampling subjects on the basis of categories that are assumed to
exist within a population and a fair proportion of representative between sub-populations should
be reserved in the sample. How is quota sampling different from stratified random sampling
discussed earlier on? By using stratified sampling method, subjects are randomly selected from
stratified groups while in quota sampling, subjects are selected non-randomly.

If no sampling frame exists, then quota sampling may be the only practical method of obtaining
a sample. This method is very widely used is marketing research. First the population is
subdivided into groups in terms of age, sex, income level etc. Then the interviewer is told how
many people to be interviewed within each specified group but is given no specific instructions
about how to locate them and fulfill the quota. It is quick to use, complications are kept to a
minimum. However, just as convenience sampling, bias may occur based on the interviewers'
subjective selection process.

In conclusion, random sampling is a fairer way to select sample. Every subject in the population
should have a chance to be selected so to avoid bias due to subjective selection. In order to do
a random sampling, an updated sampling frame must be prepared and every subject in the
population would be assigned with a unique identity number for the selection purpose.

7
Applied Statistics

Types of Data

To obtain data we must observe or measure something. This something is known as a variable.
For example, in the introduction section we talk about collecting information about (i) the number
of boxes of green tea ice-cream sold in a supermarket during a week, (ii) response to the
promotion, and (iii) delay time of a flight.

There are two major types of data: quantitative (numerical) and qualitative (non-numerical).

Data

Quantitative Qualitative

Discrete Continuous

Referring to the previous examples,

(i) Number of boxes of ice-cream sold in a supermarket during a week is a quantitative variable.
Depending on whether the green tea ice-cream is a popular choice, the number of boxes of green
tea ice-cream sold in a supermarket during a week can be any positive integer, the more popular
is the green tea ice-cream, the more number of boxes of ice-cream sold during a week.

(ii) There are many ways to measure “response”. As the response to a series of promotion
activities is classified as “like” or “dislike”, this variable is a qualitative variable. Every
interviewee is simply asked to indicate how one feel about the promotion activities by putting
yourself as in the “like” group or “dislike” group. If the overall proportion of “like” response
is high, then the promotion is a success while a low proportion of “like” response means the
promotion is failed to improve the image of the product.

(iii) The delay time of a flight is a quantitative variable. The delay time of a flight can be any
real number. Positive real number means a delay, while negative real number means the flight
arrives earlier than the expected time.

Consider the two examples of quantitative data, two scales of measurement can be further defined.
The weekly sales of the number of boxes of ice-cream can only take integer values, in this case,
this variable is considered as a discrete random variable. Delay time of a flight is measured on
a continuous scale, which is considered as a continuous random variable.

8
Applied Statistics

Referring to the three examples in p.2, below is the summary:

Research objective Primary Type of Subject of the study


variable of variable
interest
Review the sales of a product Number of Quantitative Supermarket selling the
boxes of ice- (Discrete) product
cream sold in a
week
Review the effectiveness of Comment Qualitative Teenager
the promotion in a shopping (Like / Dislike)
mall
Review the traffic (flights) in Delay time of a Quantitative Flight arriving Hong
the airport flight (Continuous) Kong International
Airport

9
Applied Statistics

Chapter 2 Statistical Measures and Data Presentation


The need to make sense of masses of information has led to formalized ways of describing the
tremendous and ever-growing amount of quantitative data being collected in almost all areas of
knowledge.

Given a raw set of data, there is often no apparent overall pattern. Perhaps some values are
more frequent, sometimes a few extreme values stand out, and usually the range of values is
noticeable. Presenting data involves such concepts as representative or average values, measure
of dispersion, and positions of various values, all of which fall under the broad topic of descriptive
statistics.

Major learning objectives of this Chapter:


➢ Be able to summarize a quantitative dataset by using: mean, mode, percentile, range, inter-
quartile range, variance, standard deviation, and skewness
➢ Be sensitive to the different formulae for the calculation of population variance and sample
variance
➢ Be able to work out the summary of a linear function of a variable

Given below is a sample of the number of boxes of green tea ice-cream sold in a supermarket
during one week time. The data is collected from a sample of 30 supermarkets. Below is the
ordered array of the data:

46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109

We are going to summarize the above data set in different dimensions.

10
Applied Statistics

2.1 MEASURE OF LOCATION

Mean
The mean of a data set is the average of all the data values.

Data collected from the whole population Data collected from a sample

= x=
x x
N n

Hence in the above example

46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109
46+49+50+⋯+109
Sample mean 𝑥̅ = 30
= 72.0667 boxes

Besides, by multiplying the mean with the size of the dataset, it gives back the total number:
72.0667 (30) = 2162 boxes

Measurement unit of mean:


Reporting the measurement unit of the summary statistics can help the readers to read the
information. You can easily report the measurement unit of mean by identify the variable of
interest.

Remark:
Descriptive Statistics v.s. Inferential Statistics
⚫ Descriptive Statistics: The focus is to report the population mean / sample mean obtained in
the census / survey so to describe the characteristics of the variable
⚫ Inferential Statistics: When the sample mean obtained in the survey is used as an estimate
of the unknown population mean.

Mode

The mode of a data set is the value that occurs with greatest frequency.

Hence in the above example

46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109

Mode is 65 boxes (it appears 3 times).

11
Applied Statistics

Percentile

The pth percentile of a data set is a value such that at least p percent of the items take on this value
or less so the other (100-p) percent of the items take on this value or more.

Procedure to find the pth percentile


Step1: Arrange data in ordered array *
𝑝 Think about the
Step2: Compile index i as the number of data in group 1, where i = n(100).
two cases for
100−𝑝
That means the number of data in group 2 is n( 100 ). handling median
Step3: Adjust i as the position of the pointer when the size of
(a) If i is not an integer, round up i. The pth percentile is the value of data is odd
the data in the ith position. number and even
(b) If i is an integer, the pth percentile is the average of the values of number!
the data in the ith position and (i+1)th position.

* Data must be arranged in ordered array. Checking position in a raw data set does not give
any information related to percentile.

In the above example, the ordered array of the data is


46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109

Find the 10th percentile, 25th percentile, 50th percentile, 75th percentile, and 88th percentile.

Solution
50+52 10
10th percentile = = 51 boxes (𝑖 = 100 (30) = 3)
2
th 25
25 percentile = 59 boxes (𝑖 = 100 (30) = 7.5 ↑ 8)
68+69 50
50th percentile = 2
= 68.5 boxes (𝑖 = 100 (30) = 15)
th 75
75 percentile = 85 boxes (𝑖 = (30) = 22.5 ↑ 23)
100
88
88th percentile = 99 boxes (𝑖 = 100 (30) = 26.4 ↑ 27)

➢ The worst 10% of the supermarkets recorded the sales of less than 51 boxes of green tea ice-
cream, half of the supermarkets had the sales of less than 68.5 boxes and the top 12% of the
supermarkets had the sales of more than 99 boxes.

Special cases
 25th percentile = First Quartile Q1
 50th percentile = Second Quartile Q2 = median
 75th percentile = Third Quartile Q3

12
Applied Statistics

2.2 MEASURE OF DISPERSION

It is important to determine not only the location of the mean, but also look at the variation within
the data. Surely you can tell the difference between the two classes of students if (a) everyone
gets 76 marks in the examination so the mean is 76 marks and (b) student’s performance has a
large difference from 18 to 98 marks with the mean is 80 marks. In general, after reporting the
central location of the data, we will continue to report the variation among the data. There are
several ways to specify the variation in the data.

Range
 It is the difference between the largest and smallest data values.
Range = maximum value – minimum value

 It is the simplest measure of variability.


 It is very sensitive to the smallest and largest data values.

Hence in the above example

46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109

Range = 109 – 46 = 63 boxes

Interquartile range (IQR)


 The interquartile range of a data set is the difference between the third quartile and the first
quartile.
IQR = Q3 - Q1

 It is the range for the middle 50% of the data.


 It overcomes the sensitivity to extreme data values.

Hence in the above example

46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109

25
Q1 = 59 boxes (𝑖 = 100 (30) = 7.5 ↑ 8)
75
Q3 = 85 boxes (𝑖 = (30) = 22.5 ↑ 23)
100

IQR = Q3 – Q1 = 85 – 59 = 26 boxes

13
Applied Statistics

Variance

 The variance is the average of the squared differences between each data value and the mean.

Data collected from the whole population Data collected from a sample
 (x −  )2
=
(x − x)
2
 =
2
s 2
N n −1

Remarks:
1. Think about the meaning of (𝑥 − 𝜇)2, it is defined as a new variable which measures the
square difference of each data point to the mean. The population variance is simply the
average of this new variable. When the variance is small, that means the difference of each
data point to the mean is small, it also means the data points are located closely together.

2. Why the sample variance with similar formula as the population variance but with the
denominator equals to n-1? It is because it makes the sample variance a better estimator
of the population variance.

Standard deviation

 The standard deviation of a data set is the positive square root of the variance.

Standard deviation = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

 It is measured in the same unit as the data, making it more easily comparable, than the
variance, to the mean.
 Population standard deviation is denoted as  while the sample standard deviation is denoted
as s.
 Practically, we calculate the standard deviation by using the calculator. (Refer to the
appendix)

Hence in the above example

46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109

(46−72.0667)2 +(49−72.0667)2 +⋯(109−72.0667)2


Sample variance s2 = 30−1
= (s)2 = 313.7885

Sample standard deviation s = √s2 = 17.7140 boxes


Does it make sense if
Alternatively, we say most of the
sample standard deviation = 17.7140 boxes (from calculator) supermarkets sold
sample variance = 17.71402 = 313.7858 about 54 to 90 boxes of
green tea ice-cream?
➢ On the average, a supermarket sold 72 boxes of green tea
ice-cream in a week with standard deviation of 17.7 boxes.

14
Applied Statistics

2.3 SKEWNESS OF DATA

When the relative frequency of a variable at different data value is plotted, the probability density
function of the variable is visualized. A distribution can have many different shapes. We can
classify distributions according to their skewness. A distribution is symmetric if the parts above
and below its center are mirror images in the density function. A distribution is skewed to the
right if the right side is longer, while it is skewed to the left if the left side is longer. In this
course, we use the quartiles and median to describe the skewness of data.

Right skew: Q2 - Q1 < Q3 - Q2


(Example: monthly income of a fresh graduate)

Symmetry: Q2 - Q1 = Q3 - Q2
(Example: height of a 10 years old boy)

Left skew: Q2 - Q1 > Q3 - Q2


(Example: examination result)

In general, when

Q2 – Q1 = Q3 – Q2 symmetric distribution
Q2 – Q1 > Q3 – Q2 left-skewed distribution
Q2 – Q1 < Q3 – Q2 right-skewed distribution

Hence in the above example

46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109

25
Q1 = 59 boxes (𝑖 = (30) = 7.5 ↑ 8)
100
68+69 50
Q2 = = 68.5 boxes (𝑖 = (30) = 15)
2 100
75
Q3 = 85 boxes (𝑖 = 100 (30) = 22.5 ↑ 23)

The distribution is right-skewed as Q2 – Q1 = 9.5 < Q3 – Q2 = 16.5

In summary, the number of boxes of green tea ice-cream sold in a supermarket during one
week is a variable. According to the result collected from a sample of 30 supermarkets, the
mean was 72 boxes with standard deviation of 17.7 boxes. The worst 10% of the supermarkets
recorded the sales of less than 51 boxes, half of the supermarkets had the sales of less than 68.5
boxes and the top 12% of the supermarkets had the sales of more than 99 boxes. The data was
right-skewed distributed.
15
Applied Statistics

2.4 PRESENTATION OF DATA (Reference Reading)

Whenever you collect a set of data it is useful to plot the distribution. In statistics, there are
several ways that we employed usually.

Histogram
Histograms are an efficient and common way to describe distributions of continuous variables.
In general, histograms plot the frequency of occurrence of some observation within given fixed
width intervals.

Cumulative Frequency Distribution


Sometimes it is preferable to present data in a cumulative frequency curve, which shows how
many items are less than, or greater than various values.

While determining the percentage of women taller than 177 cm means integrating the frequency
distribution (left figure). The same number can be obtained from the cumulative frequency
distribution (right) by simply setting a threshold value. Percentiles can also be read from the
cumulative frequency distribution directly, for example, 88th percentile = 177cm.

16
Applied Statistics

Stem and Leaf Plot


For example, there are 10 data:
60, 51, 53, 42, 45, 42, 51, 65, 62, 50

The stem and leaf diagram to represent these numbers is:

Stem Leaf (unit: 1)


4 225
5 0113
6 025

Each digit to the left of the vertical line is a stem. The digits on the right of the vertical line are
the leaves associated with the stems. For the first row the stem is 4 and the leaves are 2, 2 and
5. This row represents 42, 42 and 45. This stem and leaf diagram has been created by splitting
each number into two parts in which the tens digit becomes the stem and the units digit the leaf.

Once the data have been ordered into stems and leaves, it is usual to order the leaves in ascending
order.

Box Diagram
It is used for the purpose of display five features of a set of data, including the minimum, Q 1,
median, Q3, and maximum, in a proper scale (horizontally or vertically). Box diagram depicts
the location of the center, the spread of the data and the distribution of the data. The following
box plot is the examination marks of 196 students.

98

88
80

71

18

From the box plot, you can get the following information:
➢ Half of the students score less than 80 marks and half of the students score more than 80
marks.
➢ The range of the scores is 80 marks (98 – 18) and the interquartile range is 17 marks (88
– 71).
➢ The distribution is slightly left skewed.

The advantage of the diagram is that it can summarize all five important features in one graph.
It is useful especially in comparison of several distributions. However, unlike stem and leaf
diagram, it does not show the detail of every single data.
17
Applied Statistics

2.5 SUMMARY STATISTICS OF A LINEAR FUNCTION

Sometimes instead of just focusing on the analysis of the given variable, it is also the interest to
analyse a function of it. A simple linear function, which involves multiplication of a constant,
addition of a constant, or both applications, is often observed in daily application.

Y = a + bX

Think about the following applications, how can we express Y in terms of X

X Y
Number of items sold in a Monthly salary, which is calculated with basic salary of
month by a salesperson $20000 and an allowance of $30 for each item sold
Y = 20000 + 30X

Weight of a boy (in kg) Weight of a boy (in pound)

Original price Discounted price with 10% off

Original monthly salary Adjusted monthly salary with 4% increment

With the summary statistics for variable X has been calculated, the summary statistics of variable
Y can be calculated directly without regenerate the dataset with the following relationship

Summary statistics Y = a + bX
Mean Mean(Y) = a + b Mean(X)
Percentile pth(Y) = a + b pth(X)
Range Range(Y) = |b| Range(X)
IQR IQR(Y) = |b| IQR(X)
Standard deviation SD(Y) = |b| SD(X)
Variance Variance(Y) = b2 Variance(X)

Below is an example with X as the variable which measures the number of items sold by a
salesperson in month. The summary statistics are generated from a random sample of 15
salespersons. Without reviewing the raw dataset, the summary statistics for variable Y, the
monthly salary earned by a salesperson, can be easily generated as follow:

Summary statistics X Y = 20000 + 30X


Mean 648.73 items 20000 + 30(648.73) = $39461.90
10th Percentile 345 items 20000 + 30(345) = $30350
Median 668 items 20000 + 30(668) = $40040
90th Percentile 904 items 20000 + 30(904) = $47120
Range 690 items 690(30) = $20700
Standard deviation 229.54 items 229.54(30) = $6886.20
Variance 52688.61 (items2) 52688.61 (302) = 47419749.00 ($2)
18
Applied Statistics

APPENDIX (Reference Reading)

When calculating a statistical parameter of a data set, it is often necessary to use an intermediary
result (e.g. the mean) during the computation. By including such an estimator in the calculation,
the number of independent scores is reduced, or we say that the degree of freedom is reduced by
one.

When we consider the calculation of the sample variance, which is computed by averaging the
squares of the deviations from the mean value. As the population mean is unknown, it is
estimated by the sample mean.

=
(x − x)
2
2
s
n −1

Since the average x is computed from all scores x , the number of independent x in the formula
above is reduced from n to n-1 because you could calculate one particular score by using the
mean and the other (n-1) scores.

Generally speaking, the degrees of freedom ( df ) depend on the number of independent


observations: the number n of observations less the number of estimated parameters a:

df = n − a

19
Applied Statistics

Calculator Usage on Descriptive Statistics


(For Casio fx-50FH / fx-50FH II)

Date Set:
163.6 156.2 166.3 179.3 157.8 165.4 159.5 161.7 160.4

1. Change to “SD” mode


MODE MODE SD

2. Clear previous data


SHIFT CLR Stat EXE

3. Input data
163.6 DT 156.2 DT 166.3 DT 179.3 DT
157.8 DT 165.4 DT 159.5 DT 161.7 DT
160.4 DT

4. Calculate descriptive statistics


Mean (𝑥̅ =163.3555556) : SHIFT 2 1 EXE
Population standard deviation ( 𝑥 𝜎𝑛 = 6.459637417) : SHIFT 2 2 EXE
Sample standard deviation ( 𝑥 𝜎𝑛−1= 6.851480132) : SHIFT 2 3 EXE
No. of data input (n= 9) : SHIFT 1 3 EXE

5. Change Data
Example: change the first data ‘163.6’ to ‘183.6’
▲/▼ (until you see x1=163.6) 183.6 EXE

6. Delete Data
Example: delete the second data ‘156.2’
▲/▼ (until you see x2=156.2) SHIFT DT

7. Frequency (more than 1 observation)


Example: 5, 5, 5, 5
5 SHIFT , 4 DT

8. Return to normal mode


MODE 1

20
Applied Statistics

Commonly used notation and useful formulae


Population Sample
Size N n
Measures of Location
Mean
μ=
x x =
x
N n
Mode The value in a set of data that appears most frequently
pth percentile A value such that at least p% of the observations take on this value or
less
• compute index i = n(p%)
• if i is an integer, the pth percentile is the average of the values in
positions i and i + 1
• if i is not an integer, round up the index i. The pth percentile is
the value in position i
Median Median = 50th percentile
Quartiles First quartile: Q1 = 25th percentile
Second quartile: Q2 = 50th percentile
Third quartile: Q3 = 75th percentile
Measures of Dispersion
Range Range = Xlargest – Xsmallest
Interquartile range IQR = Q3 – Q1
Variance
 2
=
 (x − ) 2

s 2
=
 ( x − x) 2

N n −1
Standard deviation
=
 (x − ) 2

s=
 ( x − x) 2

N n −1
Skewness
Describe the shape of If Q2 – Q1 > Q3 – Q2: left skewed
the distribution as If Q2 – Q1 = Q3 – Q2: symmetric
If Q2 – Q1 < Q3 – Q2: right skewed

21
Applied Statistics

Chapter 3 Probability
Probability is the likelihood or chance of the happening of “something”. For example, we may
want to know how likely a customer will spend $200 or more in one visit to the supermarket.
An event A calls ‘spending $200 or more’ can be defined. With sufficient information, we may
be able to evaluate that the probability that a customer will spend $200 or more in one visit to the
supermarket, denoted as P(A) = 0.8.

As you should have learnt some probability theories in your previous study, in this chapter, we
just review some of the important concepts.

Major learning objectives of this Chapter:


➢ Be able to compile empirical probability based on the collected information
➢ Be able to compile the conditional probability

Sample Space and Event


When we talk about the topic relate to probability, it must be a situation with uncertain outcome.
Typical example involves discussing the result of tossing a die, tossing a coin, or now we talk
about the amount of money a customer will spend in one visit to the supermarket.

The sample space is defined as the set of all possible outcomes, usually denoted by S.

An event is a subset of the sample space, which is also the probability statement you want to
evaluate.

Example 1
What is the probability of getting number “1” which a fair die is tossed?
S = {1, 2, 3, 4, 5, 6}
A = {1}

Example 2
What is the probability that a customer will spend $200 or more in a one visit to the supermarket?
S = {x  0} x is the spending in one visit to the supermarket
A = {x  200}

22
Applied Statistics

Compiling Probability

There are two different approaches to compile probabilities: classical probability and empirical
probability.

Classical Probability:
Assuming each sample point has the same opportunity of happening, the probability of an event
is:
𝑛(𝐴)
𝑃(𝐴) =
𝑛(𝑆)
where n(A) is the number of sample point in event A.

Example 1
What is the probability of getting number “1” which a fair die is tossed?
S = {1, 2, 3, 4, 5, 6}
A = {1}
1
P(A) =
6

Empirical Probability:
We need to observe the relative frequency in actual experiment and use the relative frequency of
the event as the probability. This type of probability could be used when we study the result of
a survey or record collected in the past.

Example 2
What is the probability that a customer will spend $200 or more in one visit to the supermarket?
S = {x  0}, x is the spending in one visit to the supermarket
A = {x  200}

Referring to the following result from a survey:


Frequency
Spent < $200 120
Spent $200 or more 480
Total 600

480
P(A) = = 0.8, i.e. the probability that a customer will spend $200 or more in one visit to the
600
supermarket is 0.8.

23
Applied Statistics

Important Rules of Probability

There are some important rules:


 0  P ( A)  1
 P ( S ) = 1; P (  ) = 0
 P ( A) + P ( A ) = 1 (complementary rule)

Example 2
As the probability that a customer will spend $200 or more in one visit to the supermarket is 0.8,
it also implies the probability that a customer will spend less than $200 in one visit to the
supermarket is 0.2. Comparatively it is more likely a customer will spend $200 or more than
less than $200 in one visit to the supermarket.

Conditional Probability
When extra requirement is specified before probability is compiled, we say that the conditional
probability is required to be calculated. For example, the calculation of the below conditional
probabilities help us to compare the behavior of two group of customers.

Example 2
(i) What is the probability that a male customer will spend $200 or more in a one visit to the
supermarket?
(ii) What is the probability that a female customer will spend $200 or more in a one visit to the
supermarket?

In order to answer these two questions, we need to reorganize the previous information in a
contingency table:

Male Female Frequency


Spent <$200 75 45 120
Spent $200 or more 175 305 480
Total 250 350 600

175
(i) P(A | male) = = 0.7
250
305
(ii) P(A | female) = = 0.8714
350

➢ Comparatively, the chance for a female customer to spend $200 or more in one visit to
the supermarket is relatively higher than that for a male customer.

Do you know how to


interpret the notations
P(A|male) and
P(male|A)?

24
Applied Statistics

Independent Events

Just now we evaluate that the chance for a female customer to spend $200 or more in one visit to
the supermarket is higher than that for a male customer.

As knowing the gender of a customer would change the chance for that customer to spend $200
or more, in a statistical sense, we say that “the spending in one visit to the supermarket” and “the
gender” are two dependent events

Only when knowing the happening of one event does not change the chance of happening of the
other information, the two events are named as independent events.

Compare the following situations:


(I) Are the spending in the supermarket and the gender independent variables?
Male Female Frequency
Spent < $200 75 45 120
Spent $200 or more 175 305 480
Total 250 350 600

480
(i) P(spent $200 or more) = = 0.8
600
175
(ii) P(spent $200 or more | male) = = 0.7
250
305
(iii) P(spent $200 or more | female) = 350 = 0.8714
➢ “The spending in one visit to the supermarket” and “the gender” are dependent variables.

(II) Are the spending in the movie watching and the gender independent variables?
Male Female Frequency
Spent < $80 55 70 125
Spent $80 or more 165 210 375
Total 220 280 500

375
(i) P(spent $80 or more) = = 0.75
500
165
(ii) P(spent $80 or more | male) = 220
= 0.75
210
(iii) P(spent $80 or more | female) = 280 = 0.75
➢ “The spending in one visit to the cinema” and “the gender” are independent variables.

25
Applied Statistics

Chapter 4 Probability Distributions and Expectation


In chapter 2, we’ve discussed a number of summary statistics which are used to present the
characteristics of a dataset. In this chapter, we are going to do the similar thing that we try to
present the characteristics of a variable, instead of a dataset.

A quantitative random variable is one in which the outcomes are expressed numerically.
Quantitative variables are classified as discrete or continuous. In this chapter, we will look at
how to present the characteristics of a discrete random variable by its probability distribution
function and expectation. We will also look at some special cases as the discrete random
variables follow some specify distribution, for example the Binomial distribution and Poisson
distribution. The presentation of a continuous random variable will be discussed in the next
Chapter.

Basically, we summarize the characteristics of discrete random variable by reviewing its:


(i) probability distribution function
(ii) expected value (mean)
(iii) variance
(iv) standard deviation

Major learning objectives of this Chapter:


➢ Be able to summarize a quantitative discrete variable by using: probability distribution
function, expected value, variance, and standard deviation
➢ Be able to calculate the expected value, variance, and standard deviation for a function of
a variable
➢ Be able to summarize a Binomial variable by using: probability distribution function,
expected value, variance, and standard deviation
➢ Be able to calculate the expected value, variance, and standard deviation for a linear
function of a Binomial variable

26
Applied Statistics

Probability Distribution

The probability distribution of a discrete random variable is a representation of the probabilities


for all the possible outcomes. This representation might be algebraic, graphical or tabular.

For example, the telecommunication company wants to collect information about the number of
mobile phone an adult has. It is expected that most of the adult would have one mobile phone,
while some may have two, three or even four mobile phones. How would we know the ratio of
adult who has one, two, three, or four mobile phones? A simple method is to conduct a survey.

Suppose a survey has been conducted and according to the discussion we have in Chapter 3, one
can use the relative frequency in the survey result to project the probability of different events.
Suppose the following table is the result of the survey which involves a sample of 500 customers
and X represents the number of mobile phone a customer has:

x 1 2 3 4
Frequency 240 120 80 60

As expected, the biggest group of customers would be those with one mobile phone, which has
240 customers out of a total of 500 customers. If now you randomly select one customer and
talk to him, the chance that he has one mobile phone can be projected by the relative frequency,
240
𝑃(𝑋 = 1) = 500 = 0.48.

Now try to read the following table, which represents the probability distribution of the number
of mobile phone an adult has, X:

x 1 2 3 4
P(X = x) 0.48 0.24 0.16 0.12

To present probability distribution of a discrete variable X, we need to

(i) List out all the possible outcomes of the variable, and
(ii) Find the probability of each possible outcome

While a probability distribution possesses the following properties:


 P ( X = x )  0 for any value of x
 P(X = x) = 1

27
Applied Statistics

Example 1

Guess what is this variable?

x 1 2 3 4 5 6
P(X=x) 1 1 1 1 1 1
6 6 6 6 6 6

Example 2

“MovieClub” is an online platform which recruits movie lovers to share their movie watching
experience. The following is the probability distribution function of the number of visits to the
cinema made by a “MovieClub” member in a month

x 1 2 3 4 5 6
P(X=x) 0.05 0.23 0.35 0.23 0.13 0.01

What basic information can we read from the table?


(i) The “MovieClub” member would visit the cinema about 1 to 6 times in a month
(ii) Most likely a member would visit the cinema 3 times in a month.
(iii) More than 80% of the members would visits cinema 2 to 4 times in a month.

28
Applied Statistics

Expected value of X

We already have the idea that most likely an adult would have 1 mobile phone, while an adult
can have as many as four mobile phones.

x 1 2 3 4
P(X = x) 0.48 0.24 0.16 0.12

Is there any way we can calculate the expected (average) number of mobile phone an adult have?

We learnt the concept of mean / average in Chapter 2. Applying for a dataset, mean is calculated
accumulated total
as number of data . If we use the survey result to calculate the mean, then among the 500
customers, there are 960 mobile phones, so the average = 1.92.

1(240)+2(120)+3(80)+4(60)
= 1.92
240+120+80+60

If you take a closer look at this calculation, you would aware the number of customers is not
really important. If you replace the frequency by the relative frequency (probability), then the
calculation of the mean would be:

1(0.48) + 2(0.24) + 3(0.16) + 4(0.12)


= 1.92
0.48 + 0.24 + 0.16 + 0.12

This is the mean of X or expected value of X. The expected value of X is usually written as E(X)
and sometimes using  . In general, for a discrete random variable of X:

 = E(X ) =  xP ( X = x)

29
Applied Statistics

Variance of X

It is always more difficult to understand the concept of variance. Again, how do you intercept
the idea of variance of X? As we say on the average, an adult has 1.92 mobile phones, does it
really mean everyone has 1.92 mobile phones? Of course not! There must be a difference
between the real value X to the expected value of X, variance is the measurement which measure
the average square difference of the data point to the mean
∑(𝑥 − 𝜇)2
𝑉𝑎𝑟(𝑋) =
𝑁
which can be simplified as
∑ 𝑥2
𝑉𝑎𝑟(𝑋) = − 𝜇2
𝑁
So for our example:

Method 1
Use the data set with 500 data, define a new variable as (X – μ)2, variance is the average of this
new variable
x 1 2 3 4
(x – μ)2 (1 - 1.92)2 (2 - 1.92)2 (3 - 1.92)2 (4 - 1.92)2
Frequency 240 120 80 60

240(1 − 1.92)2 +120(2 − 1.92)2 +80(3 − 1.92)2 +60(4 − 1.92)2


Var(X) = = 1.1136
500

Method 2
Consider X2 as a function of X, calculate the value of X2 and use the simplified formula to do the
calculation:
x 1 2 3 4
2 2 2 2
x 1 2 3 42
P(X = x) 0.48 0.24 0.16 0.12

Var(X) = 12 (0.48) + 22 (0.24) + 32 (0.16) + 42 (0.12) − 1.922 = 1.1136

Var(X) = E(X2) – E(X)2

Standard deviation of X

The positive square root of the variance gives the standard deviation of X.

σ(X) = √1.1136=1.0553

➢ In summary, an adult has an average of 1.92 mobile phones with a standard deviation of
1.0553 mobile phones.

30
Applied Statistics

The three important formulae are:

Expectation of X: E(X) = ∑ 𝑥𝑃(𝑋 = 𝑥)

Variance of X: Var(X) = E(X2) – E(X)2

Standard deviation of X: 𝜎(𝑋) = √𝑉𝑎𝑟(𝑋)

Example 2
The following is the probability distribution function of the number of visits to the cinema by a
“MovieClub” member in a month

x 1 2 3 4 5 6
P(X=x) 0.05 0.23 0.35 0.23 0.13 0.01

What are the expectation and standard deviation of the number of visits to the cinema made by a
“MovieClub” member in a month?

Solution
E(X) = 1(0.05) + 2(0.23) + 3(0.35) + 4(0.23) + 5(0.13) + 6(0.01) = 3.19
Var(X) = 12(0.05) + 22(0.23) + 32(0.35) + 42(0.23) + 52(0.13) + 62(0.01) – 3.192 = 1.2339
(X) = √1.2339 = 1.1108

➢ On the average, a “MovieClub” member would make 3.19 visits to the cinema in a
month, with a standard deviation of 1.1108 times.

31
Applied Statistics

Function of X

In general if Y = f(X) is any function of the discrete random variable X then by substitution, the
probability distribution function of Y can be regenerated by transforming each possible value of
x to its corresponding value of y.

For example, given the probability distribution of the number of mobile phone an adult has, X:

x 1 2 3 4
P(X= x) 0.48 0.24 0.16 0.12

Suppose Y is the monthly spending on mobile service and assume Y = 150X, then the probability
distribution of the monthly spending on mobile service is as follow:

y 150 300 450 600


P(Y= y) 0.48 0.24 0.16 0.12

With the probability distribution of Y is generated, the expectation, variance, and standard
deviation of Y can be calculated:
E(Y) = 150(0.48) + 300(0.24) + (450)(0.16) + (600)(0.12) = $288
Var(Y) = 1502(0.48) + 3002(0.24) + (450)2(0.16) + (600)2(0.12) - 2882 = 25056
(Y) = √25056 = $158.29

➢ In summary, an adult spends an average of $288 for mobile service with standard
deviation of $158.29.

32
Applied Statistics

Manipulation of Expected Value and Variance – (I) Linear function

If X is a random variable and a and b are constants, for Y is a linear function of X,


Y = a + bX

E(Y) = a + bE(X)

Var(Y) = b2Var(X)

(Y) = |b|(X)

As in our example, based on the probability distribution of X,


x 1 2 3 4
P(X = x) 0.48 0.24 0.16 0.12

we already found out E(X) = 1.92, Var(X) = 1.1136, (X) = 1.0553

For Y = 150X,
Then E(Y) = 150E(X) = 150(1.92) = $288
Var(Y) = 1502Var(X) = 1502(1.1136) = 25056
(Y) = 150(X) = $158.29

33
Applied Statistics

Manipulation of Expected Value and Variance – (II) Sum of Independent


Variables
If X and Y are two independent random variables, for T = X + Y

E(T) = E(X) + E(Y)

Var(T) = Var(X) + Var(Y)

(T) = √𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌)


In our example, X is the number of mobile phone an adult has with its probability distribution
function is given as:
x 1 2 3 4
P(X = x) 0.48 0.24 0.16 0.12

E(X) = 1.92, Var(X) = 1.1136, (X) = 1.0553

Suppose another survey is conducted and Y is the number of tablet device an adult has with its
probability distribution function is given as:
y 0 1 2 3
P(Y = y) 0.25 0.55 0.15 0.05

E(Y) = 0(0.25) + 1(0.55) + 2(0.15) + 3(0.05) = 1


Var(Y) = 02(0.25) + 12(0.55) + 22(0.15) + 32(0.05) – 12 = 0.6
(Y) = √0.6 = 0.7746

If now we are interested in the total number of electronic devices (mobile phone plus tablet device)
an adult has, then a new variable T is defined, where T = X + Y. The detailed probability
distribution function of T cannot be found out easily, however, by assume X and Y are independent,
the summary of T can be easily generated as:

E(T) = E(X) + E(Y) = 1.92 + 1 = 2.92


Var (T) = Var(X) + Var(Y) = 1.1136 + 0.6 = 1.7136
(T) = √𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌) = √1.7136 = 1.3090

➢ On the average, an adult has 2.92 electronic devices with a standard deviation of 1.3 items.

34
Applied Statistics

Binomial Distribution

In many situations, an experiment has (or can be converted as) only two possible outcomes, one
of the outcome is denoted as success and the other one, naturally, is denoted as failure. For
example, there is 80% chance that a customer will spend $200 or more in one visit to the
supermarket and 20% chance that a customer will spend less than $200 (example in Chapter 3).
When a series of identical experiments is repeatedly observed, the total number of successful
cases among the n independent identical trials most likely is our interest.

The binomial distribution is used to summarize / predict the outcome for the repeated
observations of the identical experiment.

Example 3
(a) Suppose there are 2 customers in the queue.
(i) How many of them may spend $200 or more?
(ii) What is the probability that exactly 1 of them spends $200 or more?

(b) Suppose there are 7 customers in the queue.


(i) How many of them may spend $200 or more?
(ii) What is the probability that exactly 5 of them spend $200 or more?

Variable X is defined as following a binomial distribution, when


 there are n independent identical trials
 each trial has only two possible outcomes, namely success and failure
 probability of success is p
 X measures the number of successful cases
the above information is commonly presented as

X ~ Bin(n, p)
and its probability distribution is given as:

𝑥
𝑃(𝑋 = 𝑥 ) = 𝑛𝐶𝑥 (𝑝) (1 − 𝑝)𝑛−𝑥 , x = 0, 1, 2, ... , n

35
Applied Statistics

Example 3
(a) (i) When there are 2 customers in the queue, with X denotes the number of customers
spend $200 or more, x can be either 0, 1, or 2.

(ii) There are two different situations. It will end up as exactly one of them spends $200
or more:
P(exactly one spends $200 or more)
= P(the first one spends $200 or more and the second one spends less than $200)
+ P(the first one spends less than $200 and the second one spends $200 or more)
= 0.8(0.2) + (0.2)(0.8) = 0.32
Do you remember how to
In a Binomial sense, X ~ Bin(2, 0.8)
construct the two levels
P(X = 1) = 2C1(0.8)(0.2) = 0.32
tree diagram?

(b) (i) Suppose there are 7 customers in the queue. Denote X as the number of customers
spend $200 or more where x can be either 0, 1, 2, 3, 4, 5, 6, or 7.

(ii) There are many ways (do you know how many ways?) so that there are exactly 5
customers spend $200 or more

P(exactly 5 customers spend $200 or more)


= P(*****xx) + P(****x*x) + P(****xx*) + … P(xx*****)
= (0.8)(0.8)(0.8)(0.8)(0.8)(0.2)(0.2) + (0.8)(0.8)(0.8)(0.8)(0.2)(0.8)(0.2)
+ (0.8)(0.8)(0.8)(0.8)(0.2)(0.2)(0.8) +… +(0.2)(0.2)(0.8)(0.8)(0.8)(0.8)(0.8)
= 7𝐶5 (0.8)5 (0.2)2
= 0.2753

In a Binomial sense, X ~ Bin(7, 0.8)


P(X =5) = 7C5(0.8)5(0.2)2 = 0.2753

How to summarize information as a Binomial variable?

As in (b), the variable, X, is the number of customers would spend $200 or more in the queue.
As we know
(i) there are 7 customers in the queue, and
(ii) the chance for each customer to spend $200 or more is 0.8
Then it is the case that the variable X follows a Binomial distribution, where X ~ Bin(7, 0.8)

Actually, we can use this formula P(X = x) = 7Cx(0.8)x(0.2)7-x, for x = 0, 1, 2, …, 7 to construct


the whole probability distribution function of X as:

x 0 1 2 3 4 5 6 7
P(X = x) 0.00001 0.0004 0.0043 0.0287 0.1147 0.2753 0.3670 0.2097

➢ By reviewing the probability distribution function, it is easy to understand that it is


relatively high chance to have about 5 to 7 customers would spend $200 or more in a
queue with 7 customers.

36
Applied Statistics

Expectation, Variance and Standard deviation of Binomial Distribution

What is the expectation of a binomial variable and how to calculate?

If you imagine we repeatedly group every 7 customers as a group and record the number of
customers in each group spend $200 or more. The average number of customers spends $200
or more in a group of 7 customers is presented as its expected value.

For X ~ Bin(n, p); the expectation, variance, and standard deviation of X are as follows:

E(X) = np

Var(X) = np(1-p)

σ(X) = √𝑛𝑝(1 − 𝑝)

Example 3(b)
What are the expectation and standard deviation of number of customers spend $200 or more for
many groups of 7 customers?

Solution
As there are 7 customers in each group and the chance of spending $200 or more for each
customer is 0.8, the number of customers spend $200 or more in a group follows Binomial
distribution, X ~ Bin(7, 0.8).
E(X) = 7(0.8) = 5.6 customers
Var(X) = 7(0.8)(0.2) = 1.12
σ(X) = √1.12 = 1.0583 customers
➢ For many group of 7 customers, on the average, 5.6 out of 7 customers spend $200 or
more, with the standard deviation of 1.0583 customers.

Remark:
The calculation of the expectation of a Binomial variable gives us some insight about the most
likely number of happenings in a group. As in our example, when there are 7 customers in the
queue, with the expectation of the number of customers spend $200 or more is calculated as 5.6
customers, that means most likely, there would be around 5 or 6 customers spend $200 or more.
The standard deviation helps to extend our prediction to a range covers those possibilities with
relatively high chance.

37
Applied Statistics

Linear Function of a Binomial Variable

Continue the same discussion as in page 33, the use of linear function on a Binomial variable can
extend the application to a wider range.

Example 4
Johnny has joined the training program in an elderly center. However, he does not go to the
center every day. Based on his past record, the chance he goes to the elderly center in a
particular day is 0.75 and whether he goes to the center or not every day is independent event.
Every day he goes to the center, he will call the center to arrange the transportation and the
traveling fee is $15 per day.

(a) On the average, how many days he will go to the elderly center in a week (Monday to Friday)?
(b) What is the probability that he will go exactly 4 days in a week?
(c) On the average, how much is his traveling fee to the center in a week?

Solution
(a) Use X to denote the number of days Johnny will go to the elderly center in a week
(Monday to Friday). As there are five days in a week and the chance he will go in a day
is 0.75,
X ~ Bin(5, 0.75)
E(X) = 5(0.75) = 3.75 days
(b) P(X = 4) = 5C4(0.75)4(0.25)1 = 0.3955
(c) Use Y to denote the traveling fee in a week, Y = 15X
E(Y) = 15 E(X) = 15(3.75) = $56.25

38
Applied Statistics

Useful formulae

For General discrete random variable


Expectation E(X) = ∑ 𝑥𝑃(𝑋 = 𝑥)
Variance Var(X) = E(X2) – E(X)2
Standard deviation 𝜎(𝑋) = √𝑉𝑎𝑟(𝑋)

For Y is a linear function of X, Y= a + bX


Expectation E(Y) = a + b E(X)
Variance Var(Y) = b2 Var(X)
Standard deviation (Y) = |b|(X)

For T = X + Y
Expectation E(T) = E(X) + E(Y)
Variance Var(T) = Var(X) + Var(Y)
Standard deviation (T) = √𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌)

For X is a Binomial variable, X ~ Bin(n, p)


P(X = x) 𝑃(𝑋 = 𝑥) = 𝑛𝐶x (𝑝)
𝑥
(1 − 𝑝)𝑛−𝑥 , x = 0, 1, 2, ... , n

Expectation E(X) = np
Variance Var(X) = np(1-p)
Standard deviation (X) = √𝑛𝑝(1 − 𝑝)

39
Applied Statistics

Chapter 5 Normal Distribution


Normal distribution is a very important distribution. Its bell-shaped symmetric probability
density function is a good fit to a lot of continuous variables. In this chapter, we would try to
understand the basic characteristics of a normal distribution. By knowing the mean and
variance of a normal variable, we should be able to find the probability that the variable happen
to be within a specified range and compile the normal score that fulfill a specific probability
requirement.

Major learning objectives of this Chapter:


➢ Be able to read the probability density function of a continuous variable
➢ Be able to calculate the probability function of a normal variable
➢ Be able to locate the normal score of a normal variable
➢ Be able to analyse a function of a normal variable

Probability density function of continuous random variable


As you remember in Chapter 1, we talked about a quantitative variable can be classified as
discrete variable or continuous variable. For a discrete variable, possible outcomes take places
at a list of separated positions. A probability distribution function, in which specific probability
is assigned to each outcome, can be used to explain the basic characteristics of the variable.
Furthermore, the calculation of the expectation and variance can further summarize its
characteristics.

For a continuous random variable, possible outcomes take values from a continuous spectrum.
For example, think about how long a flight would be delay compare with the expected arrival
time. The delayed time can be any real number.

As continuous random variable takes value from a continuous spectrum, that means theoretically,
there are infinitely many possible outcomes. Unlike discrete random variable which a specific
probability can be assigned for each outcome, a probability density function is used to tell the
relative likelihood at a specific location. Sometimes, a graphic presentation of the probability
density function helps to review the characteristics more easily. Here are some examples of
continuous random variables:

40
Applied Statistics

(a) This graph indicates the time a baby needs to finish a simple task in a regular body check.
Regarding to the graph, you would see that a baby takes 1 to 5 minutes to finish the task. Unlike
discrete random variable, there are infinitely many possibilities between 1 and 5 minutes. A
horizontal probability density function means that it is equally likely for a baby to finish the task
at every possible time, between 1 and 5 minutes.

(b) This graph indicates the time a student spends on revision in a week. This random variable
takes any value greater than 0 and the curve shows a down going (exponential decay) pattern.
It shows that most students do not spend much time on doing revision.

41
Applied Statistics

The probability density function of a normal distribution

A continuous random variable X is defined to be a Normal random variable if its probability


density function is given by:
2
 1 x− 
1 − 
f ( x) = e 2  
where −  x  
 2
where μ is the mean and  is the standard deviation (of course 2 is the variance).

The Normal curve is symmetric and bell-shaped about a vertical line through the mean μ. And
we usually use the notation
X ~ N , 2 ( )

42
Applied Statistics

Revision

There are a few concepts relate to normal distribution you should have learnt in your previous
study. Let’s review them before we move on.

For a continuous random variable follows a normal distribution with mean µ, variance 2
(standard deviation ), it is commonly denoted as X ~ N(µ, 2).

1. P(X < µ) = 0.5 = P(X > µ)


2. P(µ -  < X < µ) = 0.3413 = P(µ < X < µ + ) OR P(µ -  < X < µ + ) = 0.6826
3. P(µ - 2 < X <µ) = 0.4772 = P(µ < X < µ + 2) OR P(µ - 2 < X < µ + 2) = 0.9544
𝒙−𝝁
4. The standard score for a data is calculated as z =
𝝈

43
Applied Statistics

Example 1
In a cafe, the spending of a customer for a cup of coffee, X, is known to follow a normal
distribution with mean $50 and standard deviation $10, X ~ N(50, 102)

That means, the spending on a cup of coffee is a variable, someone would spend more and some
would spend less. The mean spending is known to be $50 and the standard deviation is $10.
As it is a normal distribution, we also know that

1. Half of the customers would spend more than $50.


2. 34.13% of the customers would spend $50 to $60.
3. 47.72% of the customers would spend $50 to $70.
𝟓𝟎−𝟓𝟎
4. when x = 50, the standard score z = =0
𝟏𝟎
𝟔𝟎−𝟓𝟎
when x = 60, the standard score z = =1
𝟏𝟎
𝟕𝟎−𝟓𝟎
when x = 70, the standard score z = =2
𝟏𝟎

𝜇 − 3𝜎 𝜇 − 2𝜎 𝜇−𝜎 𝜇 𝜇+𝜎 𝜇 + 2𝜎 𝜇 + 3𝜎 X ~ N(𝜇, 𝜎 2 )


20 30 40 50 60 70 80 X ~ N(50, 102 )
-3 -2 -1 0 1 2 3
Z ~ N(0, 12 )

Besides knowing the above basic information, can we do further analysis, such as
(a) What is the probability that a customer spends more than $53 for a cup of coffee?
(b) What is the value of k if 20% of the customers would spend less than $k for a cup of
coffee?

44
Applied Statistics

Finding probability for a normal variable

The probability that X lies between a and b is written as


P( a < X < b)

In this course, we will find the probability for a normal variable by


𝑥−𝜇
1. standardize the normal variable to a standard normal variable (standard score z = )
𝜎
2. look up the probability from the standard normal table

Standard normal table and standard normal distribution

The standardized normal variable follows a normal distribution with mean 0 and standard
deviation 1, which is commonly denoted as Z ~ N(0, 12). This variable Z actually is any normal
𝑋−𝜇
variable after transforming each data point to its standard score with the formula Z = 𝜎 , where
 is the mean and  is the standard deviation of the original variable.
𝑋−𝜇
For X ~ N(µ, 2); with Z = 𝜎
; then Z ~ N(0, 12)

As from the previous study, we know that


P(µ < X < µ + ) = P(0 < Z < 1) = 0.3413
P(µ < X < µ + 2) = P(0 < Z < 2) = 0.4772

Besides you spend time to remember these two probabilities:


P(0 < Z < 1) = 0.3413 and P(0 < Z < 2) = 0.4772,
actually there is a standard normal table keeps the probability function of P(0 < Z < z), where z
is any positive number correct to 2 decimal places.

45
Applied Statistics

The entries in Table I are the probabilities that a random variable having the standard normal
distribution will take on a value between 0 and z. They are given by the area of the gray
region under the curve in the figure.

TABLE I NORMAL-CURVE AREAS

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4648 0.4656 0.4664 0.4671 0.4678 0.4685 0.4692 0.4699 0.4706
1.9 0.4713 0.4719 0.4725 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
Also, for z = 4.0, 5.0 and 6.0, the areas are 0.49997, 0.4999997, and 0.499999999.

46
Applied Statistics

Below is the top few rows of the standard normal table. Let’s see how to use the table to read
out probabilities relate to z = 0.32

0 0.32
The entries in Table I are the probabilities that a random variable having the standard normal
distribution will take on a value between 0 and z. They are given by the area of the gray
region under the curve in the figure.

TABLE I NORMAL-CURVE AREAS

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224

➢ P(0 < Z < 0.32) = 0.1255 (12.55% data takes value between 0 to 0.32)

➢ P(-0.32 < Z < 0) = 0.1255 (12.55% data takes value between -0.32 to 0)

➢ P(Z > 0.32) = 0.3745 (37.45% data has value greater than value 0.32)

➢ P(Z < 0.32) = 0.6255 (62.55% data has value less than 0.32)

How does the standard normal table help us to review the probability function of other normal
variable? For example, how can we find out the probability that a customer would spend more
than $53 for a cup of coffee, suppose the spending follows a normal distribution with mean $50
and standard deviation $10?

47
Applied Statistics

Convert any normal distribution to standard normal distribution

We can standardize any normal variable X by subtracting it by μ and then divided it by the
standard deviation, . This gives the standard Normal variable Z.

X −
Z =

The value z, which is called the standard normal score, measures how far is the data away from
the mean, using standard deviation as the measurement unit.

Finding probability for a normal variable

If you want to calculate the probability for a normal variable, you may follow the following
procedure:

1. Write down the probability statement for variable X


P(a < X < b)
2. Write down the probability statement for variable Z:
𝑎−𝜇 𝑏−𝜇
P( 𝜎 < Z < 𝜎 )
3. Look up the probability function from the standard normal table
P(0 < Z < 𝑧)
4. Simple calculation with the symmetric characteristics

48
Applied Statistics

Example 1
In a cafe, the spending of a customer for a cup of coffee, X, is known to follow a normal
distribution with mean $50 and standard deviation $10, X ~ N(50, 102). What is the probability
that a customer spends more than $53 for a cup of coffee?

Solution
In order to find out the probability that a customer would spend more than $53 for a cup of coffee,
𝑋−50
the variable X, spending on a cup of coffee, is standardized with the following function Z = 10 .

Normal Distribution Standardized


Normal Distribution
 = 10 Z =1
.3821

53 X 0.30 Z
 = 50 Z = 0
53 − 50
𝑃(𝑋 > 53) = 𝑃 (𝑍 >
10  53 − 50 
) = 𝑃(𝑍 > 0.3) = 0.5 − 0.1179 = 0.3821
P( X  53) = P Z   = P( Z  0.3)
 10 

49
Applied Statistics

Locating the normal score for specific probability requirement

By reversing the previous procedure, we can locate the normal score in a normal distribution that
fulfills a specific probability requirement.

1. Locate the unknown normal score (k) reasonably in the normal curve. Make sure you
aware if the normal score should be smaller than the mean or bigger than the mean.

2. Rewrite the probability statement relate to the mean.


P(k < X < µ) for the normal score less than the mean
or P(µ < X < 𝑘) for the normal score bigger than the mean

3. Rewrite the probability statement for variable Z and find the value of a from the standard
normal table.
P(a < Z < 0) where a should be negative
or P(0 < Z < 𝑎) where a should be positive

4. Transform a back to k:
k = µ + (a)

Example 1
In a cafe, the spending of a customer for a cup of coffee, X, is known to follow a normal
distribution with mean $50 and standard deviation $10, X ~ N(50, 102). The manager wants to
know what should be the value of k so that 20% of the customers would spend less than $k for a
cup of coffee.

Solution

A graph indicates 20% customers spend less than $k:

As P(X < k) = 0.2 (k should be a position less than the mean)


P(k < X < 50) = 0.5 – 0.2 = 0.3 (rewrite the statement relate to the mean)
As P(-0.84 < Z < 0) = 0.3 (from the table)
k = 50 + (10)(-0.84) = 41.6
i.e. 20% of the customers would spend less than $41.6 for a cup of coffee.

50
Applied Statistics

Example 2
Instead of selling only coffee, the cafe also sells sliced cakes. It is known that the spending on
a piece of cake follows a normal distribution with mean of $32 and standard deviation of $6.

(a) What is the probability that a customer spends less than $30 for a piece of cake?
(b) If 60% customers would spend more than $k for a piece of cake, what is the value of k?

Solution
Use Y to denote the spending on a piece of cake, Y ~ N(32, 62)

(a) A graph indicates Y < 30: (b) A graph indicates 60% of the spending is
more than $k:

30−32
(a) 𝑃(𝑌 < 30) = 𝑃 (𝑍 < ) = 𝑃(𝑍 < −0.33) = 0.5 − 0.1293 = 0.3707
6

(b) 𝑃(𝑌 > k) = 0.6


𝑃(𝑘 < 𝑌 < 32) = 0.6 − 0.5 = 0.1
𝑘 − 32
𝑃( < 𝑍 < 0) = 0.1
6
As P(-0.25 < Z < 0) = 0.1 from table
𝑘 − 32
= −0.25
6
k = 32 + (-0.25)(6) = 30.5

51
Applied Statistics

Function of a normal variable – (I) Linear function

Suppose variable X follows a normal distribution with known µ and σ, X ~ N(µ , σ 2).
For a variable Y which is a linear function of X and be expressed as Y = a + bX, then Y also
follows a normal distribution such that

Y ~ N(a + b µ, (b σ)2)

Example 1
In the cafe, the spending on a cup of coffee follows a normal distribution with mean of $50 and
standard deviation of $10, X ~ N(50, 102). Suppose the owner of the café is considering adjust
the selling price of each cup of coffee by marking up the original price by 8% and then a discount
of $2 will be applied.
(a) What are the (i) mean and (ii) standard deviation of the selling price of a cup of coffee after
the adjustment?
(b) After the adjustment, what is the probability that someone buy a coffee which costs $54 or
more?

Solution
(a) With X as the notation of the original price of a cup of coffee and use Y to denote the adjusted
price,
Y = 1.08X – 2
(i) Mean of Y = 1.08E(X) – 2 = 1.08(50) – 2 = $52
(ii) Standard deviation of Y = 1.08 σ (X) = 1.08(10) = $10.8

54−52
(b) P(Y ≥ 54) = P(𝑍 ≥ ) = P(Z ≥ 0.19) = 0.5 – 0.0753 = 0.4247
10.8

52
Applied Statistics

Function of a normal variable – (II) Sum of two independent normal variables

The sum of two or more independent Normal variables is also Normally distributed. For two
independent Normal variables such that
X1 ~ N(μ1, σ12), and X2 ~ N(μ2, σ22), then:

𝑋1 + 𝑋2 ~𝑁(𝜇1 + 𝜇2 , 𝜎12 + 𝜎22 )

Example 3
In the cafe, the spending on a cup of coffee follows a normal distribution with mean of $50 and
standard deviation of $10, X ~ N(50, 102). It is also known that the spending on a piece of cake
follows a normal distribution with mean of $32 and standard deviation of $6, Y ~ N(32, 62).
Suppose the spending on a cup of coffee and a piece of cake are independent events. Imagine
there are many customers buying one cup of coffee and one piece of cake and you want to review
the total spending of a customer:
(i) What is the distribution of the total spending?
(ii) What is the probability that a customer spend more than $80 when buying a cup of coffee
and a piece of cake?

Solution
(i) For T to be the total spending, T = X + Y
T ~ N(50 + 32, 102 + 62);
T ~ N(82, 11.66192)
On the average, a customer spends $82 to buy a cup of coffee and a piece of cake.
There is a standard deviation of $11.6629.

(ii)
A graph indicates T > 80:

80−82
So that P(T > 80) = P(𝑍 > 11.6619) = P(Z > -0.17) = 0.5 + 0.0675 = 0.5675

53
Applied Statistics

Chapter 6
Sampling Distributions and Central Limit Theorem
In chapter 1 and 2, we talk about the concept and difference between doing census and doing
survey and understand how to calculate the mean as a summary of the characteristics of a data
set. In this chapter, we will go further to understand the relationship between population mean
and sample mean and connect them with the sampling distribution.

The objective of studying the sampling distribution is to build a foundation for the discussion of
the use of inferential statistics.

Major learning objectives of this Chapter:


➢ Understand the normal distribution characteristic of sample mean (as a variable)
➢ Understand the normal distribution characteristic of sample proportion (as a variable)

Sample mean as a random variable

Let’s use a simple example to understand the idea of sample mean is a random variable.

In a university, all year 1 students have to take “General Statistics”.


The population mean score of all students is 71.65 and the population standard deviation is 14.29

When students are randomly assigned to different classes with each class size equals to 30, the
average result of each class can be calculated:
28+32+⋯+95+97
Class 1: 28, 32, …, 95, 97 mean result = = 68.3
30
33+35+⋯+96+98
Class 2: 33, 35, …, 96, 98 mean result = = 74.4
30
30+31+⋯+88+91
Class 3: 30, 31, …, 88, 91 mean result = = 72.2
30

Selecting one class of student and review the class mean is the same idea as selecting one sample
and look at the sample mean. It is easy to understand from the above example that sample mean
is not unique, but it is a variable.

From now on, we can consider sample mean as


(i) a data – if we just focus on one particular sample
(ii) a variable – if we record each sample mean from different samples repeatedly

If sample mean is a random variable, what are the characteristics of this random variable? Is it
discrete or continuous? What are the mean and standard deviation of this random variable?

54
Applied Statistics

Sampling distribution

Suppose many different samples of the same size are obtained by repeatedly sampling from a
population with population mean  and population standard deviation . For each sample:
 sample mean x is calculated and;
 a histogram of these sample means is drawn

Sample 1
x̅ 1
Variable X,
population mean μ
Sample 2
population variance 2
x̅ 2
Sample 3
x̅ 3

Properties of the Sampling Distribution

The characteristics of the distribution function of the sample mean can then be summarized as
follows:

1. Mean of sample means is the same as the population mean.


𝐸(𝑋̅) = 𝜇

2. Variance of sample means equals population variance divided by the sample size.
𝜎2
̅
𝑉𝑎𝑟(𝑋) =
𝑛

3. Standard error, positive square root of the variance is a measurement used to represent the
average deviation from the individual sample mean to the population mean.
𝜎
𝑆𝐸(𝑋̅) =
√𝑛

Central Limit Theorem

If the sample size is reasonably large (n ≥ 30), the sample mean distribution is well approximated
by a normal distribution. (Central Limit Theorem)
 2 
X ~ N   , 
 n 

55
Applied Statistics

Sample mean as a normal variable

Sample mean is considered as a normal variable either


(i) The original variable follows a normal distribution
(ii) Sample size is reasonably large enough (n ≥ 30).

With the requirement (i) or (ii) (or both) fulfilled, further analysis by using the normal variable
characteristics can be conducted.

Example 1
As mentioned earlier, suppose the examination result of General Statistics is as follow:
Population mean: 71.65
Population standard deviation: 14.29

For every 30 students are randomly assigned to a class, the sample mean distribution of class
average is as follow:
Mean of class mean: 71.65
14.292
Variance of class mean: = 6.8068
30
14.29
Standard error of class mean: = 2.6090
√30
Because the sample size is reasonably large, 𝑋̅ ~ N(71.65, 2.60902)

➢ If you compare the performance between individual students, the mean score is 71.65 and
the standard deviation is 14.29. However, if you compare the performance between
different classes (by using class mean to represent the performance of the class), the average
of the class mean score is 71.65 and the standard deviation of the class mean score is 2.61.
It is not a surprise that comparison between classes should be more stable than comparison
between students as in each class we have some well performed and not so well performed
students. The class mean takes balance between the high marks and low marks.

56
Applied Statistics

Example 2
A report indicates that on average a tourist spends $5000 in a 3-days trip to Taiwan. The
standard deviation of the spending is $600 so the variance is 360000($2). Imagine you are a
tour guide and you take care of a group (sample) of 40 tourists every day. If you make a long
term record of the mean spending of each group of 40 tourists per day, then you should aware
that the mean spending in each group is not constant, but a variable. Use X to denote the
̅ to denote the mean spending of a sample of 40 tourists, then
spending of an individual and X

Mean of sample mean spending: 𝐸(𝑋̅) = $5000


360000
Variance of sample mean spending: 𝑉𝑎𝑟(𝑋̅) = 40 =9000 ($2)
600
Standard error of sample mean spending: 𝑆𝐸(𝑋̅) = = $94.87
√40
Because of the large sample size (n = 40 > 30), then 𝑋̅~𝑁(5000, 94.872 )

Example 3
As in our example in Chapter 5, the spending on a cup of coffee is a normal variable, with
population mean of $50 and standard deviation of $10. If repeated random samples of size 30
are selected, then the sample mean spending 𝑋̅ is considered as a random variable where
Mean of sample mean E(𝑋̅)=  = $50
𝑉𝑎𝑟(𝑋) 100
Variance of sample mean Var(𝑋̅) = 𝑛 = 30 = 3.3333
𝜎 10
Standard error of sample mean SE(𝑋̅) = =
= $1.8257
√𝑛 √30
Because the sample size 30 is large enough, then 𝑋̅ ~ N(50, 1.82572)

Remark:
𝑋− 𝜇
For X ~ N(, 2), standard score is calculated as Z = (Chapter 5)
𝜎
𝜎2 𝑋̅ − 𝜇
For 𝑋̅ ~ 𝑁 (𝜇, ), standard score is calculated as Z = 𝜎 (Chapter 6)
𝑛 ⁄ 𝑛

57
Applied Statistics

Sampling Distribution of Sample Proportion

When the population variable is a quantitative variable (e.g. examination result), the sample is
usually summarized by the calculation of the sample mean. When the population variable is a
qualitative variable (e.g. gender of a student), the sample is then summarized by the calculation
of the sample proportion.

Example 1

Imagine, for the same group of 2000 students taking the course “General Statistics”, there are
1500 male, 500 female. The variable gender is a qualitative variable. Here, we use p to denote
the population proportion of male, for example, p = 0.75.

If a class of 30 students has 24 male and 6 female, we can use 𝑝̂ to denote the class proportion
of male (sample proportion) such that 𝑝̂ = 0.8.

Imagine now we select another class of 30 students, it is easy to realize the proportion of male in
this class may or may not be the same as the previous class. We have so many classes of student
and each class has its own class proportion of male. Again, we should consider sample
proportion as a random variable.

Now, try to include all possible sample proportions and review its probability density function.

When p is used to denote the given popultion proportion, the characteristics of the density
function of the sample proportion 𝑝̂ can be summarized as follows:

1. Mean of sample proportions is the same as the population proprotion.


𝐸(𝑝̂ ) = 𝑝
2. Variance of sample proportions:
𝑝(1 − 𝑝)
𝑉𝑎𝑟(𝑝̂ ) =
𝑛
3. Standard error, positive square root of the variance of the sample proportions:
𝑝(1 − 𝑝)
𝑆𝐸(𝑝̂ ) = √
𝑛

58
Applied Statistics

Central Limit Theorem

When the sample size is reasonably large (n ≥ 30), the sample proportion distribution is well
approximated by a normal distribution. (Central Limit Theorem)
𝑝(1 − 𝑝)
𝑝̂ ~𝑁(𝑝, )
𝑛
for n > 30, np > 5, n(1-p) >5

Example 1
For all year 1 students taking the course “General Statistics”, it is known that the population
proportion of male, p = 0.75.

For every 30 students are randomly assigned to a class, the distribution of proportion of male in
a class is as
Mean of class proportion of male = 0.75
0.75×0.25
Variance of class proportion of male = = 0.00625
30
0.75×0.25
Standard error of class proportion of male = √ = 0.0791
30
As the sample size is reasonable large, 𝑝̂ ~𝑁(0.75, 0.07912 )

➢ Approximately there are about 75% male in each class, but it is not fixed. The proportion
of male in a class has a standard deviation of 7.91% around the true level of 75%.

59
Applied Statistics

Example 4
Assume that among all customers of a jewelry shop, 40% customers are classified as “high
spending”. If random samples of size 70 are selected, and each time the sample proportion of
customers classified as “high spending” is calculated and denoted as 𝑝̂ , then

With p to denote the population proportion of “high spending” customers: p = 0.4,


𝑝̂ to denote the proportion of “high spending” customers in a sample of 70 customers

E(𝑝̂ ) = 0.4
0.4(0.6)
Var(𝑝̂ ) = = 0.0034
70
0.4(0.6)
SE(𝑝̂ ) = √ = 0.05855
70
As the sample size n = 70 is reasonably large, the sample proportion is normally distributed

𝑝̂ ~𝑁(0.4, 0.058552 )

Example 5

In Airline ABC, 20% of the customers book the air ticket for business trip. A promotion focus
on this group of tourists is recently launched. With many flights, each flight with 300 customers,
the sample proportion of customers having business trip is denoted as 𝑝̂ , then

With p to denote the population proportion of customers having business trip: p = 0.2,
𝑝̂ to denote the proportion of customers having business trip in a sample of 300 customers

E(𝑝̂ ) = 0.2
0.2(0.8)
Var(𝑝̂ ) = 300 = 0.0005
0.2(0.8)
SE(𝑝̂ ) = √ 300
= 0.02309
As the sample size n = 300 is reasonably large, the sample proportion is normally distributed
𝑝̂ ~𝑁(0.2, 0.023092 )

60
Applied Statistics

Proof (Reference Reading):


You may want to know the below proofs, although they will not be examined in the examination.
The sample mean, ̅X, actually is the sum of a series of independent variables.

𝑋1 𝑋2 𝑋3 𝑋𝑛
𝑋̅ = + + +⋯+
𝑛 𝑛 𝑛 𝑛
𝑋1 𝑋2 𝑋3 𝑋𝑛
𝐸(𝑋̅) = 𝐸 ( + + + ⋯+ )
𝑛 𝑛 𝑛 𝑛
𝑋1 𝑋2 𝑋𝑛
= 𝐸( )+ 𝐸( )+ …+ 𝐸( )
𝑛 𝑛 𝑛
1 1 1
= 𝐸(𝑋1 ) + 𝐸(𝑋2 ) + … + 𝐸(𝑋𝑛 )
𝑛 𝑛 𝑛
𝜇 𝜇 𝜇
= + + …+
𝑛 𝑛 𝑛
𝜇
=(𝑛) 𝑛
= 𝜇

𝜎2
You may try to prove for 𝑉𝑎𝑟(𝑋̅) = 𝑛
𝑋1 𝑋2 𝑋3 𝑋𝑛
𝑉𝑎𝑟(𝑋̅) = 𝑉𝑎𝑟 ( + + + ⋯ + )
𝑛 𝑛 𝑛 𝑛
𝑋1 𝑋2 𝑋𝑛
= 𝑉𝑎𝑟 ( 𝑛 ) + 𝑉𝑎𝑟 ( 𝑛 ) + … + 𝑉𝑎𝑟 ( 𝑛 )
1 1 1
= 𝑉𝑎𝑟(𝑋1 ) + 𝑉𝑎𝑟(𝑋2 ) + … + 𝑉𝑎𝑟(𝑋𝑛 )
𝑛2 𝑛2 𝑛2
𝜎2 𝜎2 𝜎2
= + + …+
𝑛2 𝑛2 𝑛2
𝜎2
= (𝑛2 ) 𝑛
𝜎2
=
𝑛

61
Applied Statistics

Chapter 7 Estimation
One major type of inferential statistics is estimating the unknown parameter in the population by
the information collected from a sample. In this chapter, we will discuss how to estimate the
unknown population mean and population proportion.

In the previous chapter, for a continuous random variable X with population mean 𝜇 and
population standard deviation , the sample mean distribution for samples with sample size n
consists of the following characteristics:

(i) 𝐸(𝑋̅) = 𝜇
𝜎2
(ii) 𝑉𝑎𝑟(𝑋̅) =
𝑛
σ
(iii) SE(𝑋̅) = n

(iv) 𝑋̅ is normally distributed either n ≥ 30 or X is originally normally distributed

Similarly, for any qualitative random variable X with given population proportion in favour to
one particular option is denoted as p, the sample proportion distribution for samples with sample
size n consists of the following characteristics:
(i) 𝐸(𝑝̂ ) = 𝑝
𝑝(1−𝑝)
(ii) 𝑉𝑎𝑟(𝑝̂ ) = 𝑛
𝑝(1−𝑝)
(iii) SE(𝑝̂ ) = √
𝑛
(iv) 𝑝̂ is normally distributed when n ≥ 30, np > 5, and n(1-p) >5

In this chapter, because of the above sampling distribution characteristics, we are going to study
the technique of estimating the population mean by using the sample mean obtained from the
survey. We will also study the sampling distribution of sample proportion and make use of it to
estimate the population proportion.

Major learning objectives of this Chapter:


➢ Be able to estimate the unknown population mean (with given population standard
deviation) by the calculation of point estimate, sampling error, and confidence interval
estimate
➢ Be able to estimate the unknown population mean (with unknown population standard
deviation) by the calculation of point estimate, sampling error, and confidence interval
estimate
➢ Be able to estimate the unknown population proportion by the calculation of point estimate,
sampling error, and confidence interval estimate

Example 1

You are asked to review the lifetime of the light bulbs produced in a factory by reporting the
population mean lifetime. Lifetime is a continuous random variable. According to the
information provided by the factory, the population mean lifetime 𝜇 is unknown while the
population standard deviation is known to be 80 hours. How can we estimate the population
mean lifetime by not doing a census but only conducting a survey with sample size n = 50?

62
Applied Statistics

Estimation of the Population Mean (σ is known)

(i) Point Estimation


When we do estimation, the error (bias) in the estimation is calculated as the difference between
the estimate and the true population parameter. It is suggested to use the sample mean as the
point estimate of the population mean, as if we define error = 𝑥̅ − 𝜇, the average error would
be equals to 0. The sample mean is said to be the unbiased point estimator of the population
mean.

Point estimate of population mean = 𝑥̅

Example 1
In order to estimate the population mean lifetime, a random sample of 50 light bulbs is selected.
The sample mean lifetime is calculated as 680 hours.

In this case, the point estimate of the population mean lifetime is 680 hours.

The sample mean is a point estimate of the population mean. Definitely, a certain level of error
in the estimation is expected. The problem is: can the error in the estimation be calculated?

63
Applied Statistics

Estimation of the Population Mean (σ is known)

(ii) Sampling Error at 100(1 - α)% confidence


As the survey sample size is less than 100% of the population size, when we use the sample mean
to estimate the population mean, there must be a certain level of error. This error is named as
sampling error, which is defined as the difference between the estimate to the population
parameter:

Error = 𝑥̅ − 𝜇

How large is this sampling error? We cannot derive the sampling error for a particular sample
as the population mean is unknown (you must remember this point). However, we can derive
the sampling error at a certain confidence level (some statisticians named this maximum sampling
error as margin of error), e.g. 95% confidence level. In order to derive the sampling error at a
certain confidence level, we must be familiar with the sampling distribution.

Example 1
As you remember, we just mentioned the lifetime of the light bulb in a factory has the following
characteristics:
population mean 𝜇, which is unknown,
population standard deviation  = 80 hours
In order to estimate the population mean lifetime, a random sample of 50 light bulbs is selected.
If we don’t just focus on one particular sample, but consider we can repeatedly selecting many
samples, each with sample size n = 50, then the sample mean distribution is as:

802
𝑋̅~𝑁(, )
50

As 95% of z-scores lies between (-1.96, 1.96)


80 80
then, 95% of sample means lies between ( − 1.96 × ,  + 1.96 × )
√50 √50

Proof:
𝑋̅ − 𝜇 80 80
P(-1.96 < Z < 1.96) = P(-1.96 < 80⁄ < 1.96) = 𝑃(−1.96 × < 𝑋̅ − 𝜇 < 1.96 × )
√50 √50
√50
80 80
= 𝑃(𝜇 − 1.96 × < 𝑋̅ < 𝜇 + 1.96 × )
√50 √50

64
Applied Statistics

0.025
0.025

-1.96 0 1.96 Z
𝜎 𝜎
𝜇 − 1.96 𝜇 𝜇 + 1.96 𝑋̅
√𝑛 √𝑛

As sampling error is defined as 𝑥̅ − 𝜇,


80
The sampling error at 95% confidence level is derived as 1.96 × , i. e. 22.175 hours.
√50
That means there are 95% cases the error of the estimation is less than 22.175 hours. Only 5%
cases the error of the estimation is more than 22.175 hours.

Or in general, the sampling error at 100(1- α)% confidence level is

𝜎
𝑧𝛼/2 ×
√𝑛

We call 𝑧𝛼/2 the critical value, while 𝛼/2 is the upper tail area in the normal curve.
Commonly used confidence level includes:

Confidence level Critical value 𝑧𝛼/2


90%
95%
98%
99%

(The critical values can be easily found out from the standard normal table)

Referring to different confidence level, the sampling error would be:


80
Sampling error at 90% confidence = 1.645 × = 18.6111 hours
√50
80
Sampling error at 95% confidence = 1.96 × = 22.1749 hours
√50
80
Sampling error at 98% confidence = 2.33 × = 26.3609 hours
√50
80
Sampling error at 99% confidence = 2.575 × = 29.1328 hours
√50

➢ There is a 95% chance that the difference between the calculated sample mean and the true
population mean is no more than 22.17 hours. Only 5% chance that this difference is more
than 22.17 hours.

65
Applied Statistics

Estimation of the Population Mean (σ is known)

(iii) Confidence Interval Estimation


Combining the (i) point estimate and the (ii) sampling error, a confidence interval estimate can
be constructed as

𝜎 𝜎
(𝑥̅ − 𝑧𝛼/2 × , 𝑥̅ + 𝑧𝛼/2 × )
√𝑛 √𝑛

Let’s take a look of how to construct the 95% confidence interval estimate. As the confidence
𝜎
level is set at 95%, the sampling error is calculated as 1.96 × 𝑛 . If repeated sampling is

conducted and each time an interval is calculated based on the formula
𝜎 𝜎
(𝑥̅ − 1.96 × 𝑛 , 𝑥̅ + 1.96 × 𝑛)
√ √

0.025
0.025
𝜎
1.96
√𝑛

sample 1:
sample 2:
sample 3:
sample 4:
sample 5:
sample 6:
sample 7:
sample 8:
sample 9:
sample
10:
……
95% Confidence Intervals

We can see from the above diagram that most of the constructed intervals can cover the true
unknown population mean but only a few do not. In fact, of all these constructed intervals, 95%
can cover the true unknown population mean.

Practically, if only one random sample is selected, there is 95% chance that the constructed
confidence interval can successfully include the unknown population mean.

Example 1
As the sample mean lifetime of 50 light bulbs is 680 hours and the 95% sampling error is
calculated as 22.1749 hours, the 95% confidence interval estimate of the population mean
lifetime is:
80 80
(680 − 1.96 × , 680 + 1.96 × ) = (657.8251, 702.1749) hours
√50 √50

66
Applied Statistics

As a summary,

The unbiased point estimate of the population mean is 𝑥̅


𝜎
The sampling error with 100(1 - α)% confidence level 𝑧𝛼/2 ×
√𝑛

The 100(1 - α)% confidence interval estimate of the population mean is


σ σ
(x̅ − zα/2 × , x̅ + zα/2 × )
√n √n

Example 2
The manager of a beauty counter wants to review the spending of the customers. The population
mean spending is unknown and the population standard deviation is $180. He estimates the
population mean spending by randomly select 60 customers. The sample mean spending of the
selected 60 customers is $880.

(a) What is the point estimate of the population mean?


(b) What is the sampling error at 90% confidence level?
(c) What is the 90% confidence interval estimate of the population mean?

Solution
(a) The point estimate of the population mean spending is $880
(b) With σ = 180, n = 60,
180
the sampling error at 90% confidence level = 1.645 × = $38.2263
√60
(c) The 90% confidence interval estimate of the population mean is:
180 180
(880 − 1.645 × , 880 + 1.645 × ) = $ (841.77, 918.23)
√60 √60

➢ The population mean spending is point estimated as $880 with a 90% sampling error of
$38.2263.

67
Applied Statistics

Estimation of the Population Mean (σ is unknown)

In the previous session, we estimate the population mean by 3 steps:

(i) the point estimate: 𝑥̅


σ
(ii) the sampling error with 100(1 - α)% confidence level: zα/2 ×
√n
𝜎 𝜎
(iii) the 100(1 - α)% confidence interval estimate: (x̅ − z𝛼/2 × n
, x̅ + z𝛼/2 × )
√ √𝑛

What if, practically the population standard deviation is unknown?

If the random variable X is normally distributed, the following statistics

𝑋̅ −𝜇
𝑧= follows a standard normal distribution
𝜎/√𝑛
𝑋̅ −𝜇
𝑡 = 𝑠/ follows a t-distribution with degrees of freedom n – 1
√𝑛

for which  is the population standard deviation and s is the sample standard deviation, which is
the best estimator of the unknown population standard deviation.

What’s the difference between z and t transformation?

The calculation of t-value is almost the same as the standard score z-value, but the population
standard deviation is replaced by the sample standard deviation. The sample standard deviation
is reasonably close to the population standard deviation and is a variable, which is different from
sample to sample. As a result, the t-distribution looks similar to the standard normal distribution
but with “fatter” tail. The t-distributions with different sample size are different. In fact, the
t-distribution is getting closer to the standard normal distribution by increasing the sample size.

68
Applied Statistics

Comparison between standardized normal distribution and t-Distribution

Standardized
Normal

Bell-Shaped t (df = 13)


Symmetric
‘Fatter’ t (df = 5)
Tails

Z
t
0
T-distribution is very similar to the standard normal distribution, while t-distribution has
relatively fatter tails. When the degrees of freedom (degrees of freedom is defined as sample
size minus 1, df = n-1) increases, the t-distribution is getting more similar to the standard normal
distribution. The reason behind it is a larger sample size makes the sample standard deviation
a more accurate estimator of the population standard deviation. It is well accepted that when
the degrees of freedom is greater than 29, the t-distribution is well approximated by the standard
normal distribution.

Let’s use the standard normal table and t-table to look up the middle 95% data:

Standard normal distribution : -1.96 to 1.96


t-distribution with degrees of freedom 5 (sample size = 6): -2.571 to 2.571
t-distribution with degrees of freedom 13 (sample size = 14): -2.160 to 2.160
t-distribution with degrees of freedom larger than 29: -1.96 to 1.96

69
Applied Statistics

The entries in Table II are values for which the area to their right under the t distribution
with given degrees of freedom (the gray area in the figure) is equal to  .

TABLE II VALUE OF t

d.f. t0.050 t0.025 t0.010 t0.005 d.f.

1 6.314 12.706 31.821 63.657 1


2 2.920 4.303 6.965 9.925 2
3 2.353 3.182 4.541 5.841 3
4 2.132 2.776 3.747 4.604 4
5 2.015 2.571 3.365 4.032 5

6 1.943 2.447 3.143 3.707 6


7 1.895 2.365 2.998 3.499 7
8 1.860 2.306 2.896 3.355 8
9 1.833 2.262 2.821 3.250 9
10 1.812 2.228 2.764 3.169 10

11 1.796 2.201 2.718 3.106 11


12 1.782 2.179 2.681 3.055 12
13 1.771 2.160 2.650 3.012 13
14 1.761 2.145 2.624 2.977 14
15 1.753 2.131 2.602 2.947 15

16 1.746 2.120 2.583 2.921 16


17 1.740 2.110 2.567 2.898 17
18 1.734 2.101 2.552 2.878 18
19 1.729 2.093 2.539 2.861 19
20 1.725 2.086 2.528 2.845 20

21 1.721 2.080 2.518 2.831 21


22 1.717 2.074 2.508 2.819 22
23 1.714 2.069 2.500 2.807 23
24 1.711 2.064 2.492 2.797 24
25 1.708 2.060 2.485 2.787 25

26 1.706 2.056 2.479 2.779 26


27 1.703 2.052 2.473 2.771 27
28 1.701 2.048 2.467 2.763 28
29 1.699 2.045 2.462 2.756 29
Inf. 1.645 1.960 2.326 2.576 Inf.

70
Applied Statistics

By using t-distribution as a replacement of the standard normal distribution, now we can estimate
the population mean with the 3 steps procedure:

The unbiased point estimate of the population mean is 𝑥̅


𝑠
The sampling error with 100(1 - α)% confidence level is 𝑡𝛼/2 ×
√𝑛

The 100(1 - α)% confidence interval estimate of the population mean is


s s
(x̅ − t α/2 × , x̅ + t α/2 × )
√n √n

where 𝑡𝛼/2 is the critical value with 𝛼/2 as upper tail area and n-1 as the degrees of freedom.

Remark:
The t-distribution is developed with the assumption that the random variable X follows a normal
distribution. Practically, we can use the t-distribution to estimate the population mean when the
sample size is large enough (n > 30).

Example 3
In order to estimate the population mean age of patients of a dentist, a random sample of 20
patients is selected. The sample mean age is 37.4 and the sample standard deviation is 7.8.
Assume that the age of all patients follow a normal distribution.

(a) What is the point estimate of the population mean?


(b) What is the sampling error at 90% confidence level?
(c) What is the 90% confidence interval estimate of the population mean?

Solution
With 𝑥̅ = 37.4, s = 7.8, n = 20, d.f. = 19, t19, 0.05 = 1.729
(a) point estimate of population mean age is 37.4
7.8
(b) 90% sampling error is 1.729 × = 3.0156
√20
7.8 7.8
(c) 90% C.I. of the population mean is (37.4 − 1.729 × , 37.4 + 1.729 × )
√20 √20
= (34.3844, 40.4156)

➢ The population mean age of all patients is point estimated as 37.4 with the 90% sampling
error of 3.0156.

71
Applied Statistics

Estimation of the Population Proportion

Besides the estimation of the population mean, another commonly estimated parameter is the
population proportion.

Very often, we are interested in knowing the proportion of people in flavor to a particular option.
For example, what proportion of residents would support “Alex” to be the next president?
What proportion of people prefers the new flavor of green tea ice-cream when compare to the
chocolate ice-cream? What proportion of tourist would like to go to Japan as the destination of
the next vacation?

Similar to the estimation of the population mean by sample mean, we are going to use the sample
proportion as a point estimate of the population proportion. Before that, we need to review the
sampling distribution of sample proportion as in Chapter 6.

Revision: Sampling distribution of sample proportions

Suppose we start with a population with the population proportion equals to p. When random
samples with the same sample size are repeatedly drawn from the this population, the sample
proportions, can be viewed as a random variable and the sampling distribution of sample
proportion is:
𝑝(1 − 𝑝)
𝑝̂ ~𝑁(𝑝, )
𝑛
for n > 30, np > 5, n(1-p) > 5.

p is the notation of the unknown population proportion, while 𝑝̂ is the sample proportion.

72
Applied Statistics

What if the population proportion p is unknown?

We are going to develop the 3 steps estimation of the population proportion by using the similar
approach as the estimation of the population mean.

0.025
0.025

-1.96 0 1.96 Z

𝑝(1 − 𝑝) 𝑝(1 − 𝑝)
𝑝 − 1.96√ 𝑝 𝑝 + 1.96√ 𝑝̂
𝑛 𝑛
In general,
The unbiased point estimate of the population proportion is 𝑝̂
𝑝̂(1−𝑝̂)
The sampling error with 100(1 - α)% confidence level is 𝑧𝛼/2 √
𝑛
The 100(1 - α)% confidence interval estimate of the population proportion is

𝑝̂ (1 − 𝑝̂ ) 𝑝̂ (1 − 𝑝̂ )
(𝑝̂ − 𝑧𝛼/2 √ , 𝑝̂ + 𝑧𝛼/2 √ )
𝑛 𝑛

Example 4

Before the election, an organization has conducted a survey to investigate the supportive rate of
each candidate. The survey randomly interviewed 500 qualified voters, among them, 245
indicated that they would vote for Alex in the coming election.

(a) What is the point estimate of the population proportion of voters who support Alex?
(b) What is the sampling error at 95% confidence level?
(c) What is the 95% confidence interval estimate of the population proportion of voters who
support Alex?

Solution
Use p to denote the population proportion of voters who support Alex
(a) Point estimate of p = 0.49
0.49(0.51)
(b) 95% sampling error = 1.96√ = 0.04382
500
(c) 95% C.I. of p
 0.49(0.51) 0.49(0.51) 
=  0.49 − 1.96 , 0.49 + 1.96 

 500 500 
= (0.4462, 0.5338)

➢ The population proportion of voters who vote for Alex is point estimated as 49% with the
95% sampling error of 4.38%.

73
Applied Statistics

Useful formulae

Estimation of Population Mean ( is known)


Point estimate 𝑥̅
Sampling error at (100 –α)% 𝜎
𝑧𝛼/2 ×
confidence level √𝑛
(100 –α)% confidence interval
𝜎 𝜎
estimate (𝑥̅ − 𝑧𝛼/2 × , 𝑥̅ + 𝑧𝛼/2 × )
√𝑛 √𝑛

Estimation of Population Mean ( is unknown)


Point estimate 𝑥̅
Sampling error at (100 –α)% 𝑠
𝑡𝛼/2 ×
confidence level √𝑛
(100 –α)% confidence interval
𝑠 𝑠
estimate (𝑥̅ − 𝑡𝛼/2 × , 𝑥̅ + 𝑡𝛼/2 × )
√𝑛 √𝑛

Estimation of Population Proportion


Point estimate 𝑝̂
Sampling error at (100 –α)%
confidence level 𝑝̂ (1 − 𝑝̂ )
𝑧𝛼/2 √
𝑛
(100 –α)% confidence interval
estimate 𝑝̂ (1 − 𝑝̂ ) 𝑝̂ (1 − 𝑝̂ )
(𝑝̂ − 𝑧𝛼/2 √ , 𝑝̂ + 𝑧𝛼/2 √ )
𝑛 𝑛

74
Applied Statistics

Chapter 8 Hypothesis Testing


In the previous chapter, we look at the type of inferential statistics that we try to estimate the
unknown population parameter by the data collected in a sample. In this chapter, we look at
another type of inferential statistics that an assumption about the population parameter is tested
by the information provided in a sample.

Why do we have to do the hypothesis testing? It’s because we have an assumption about the
population parameter that we are not quite sure whether it is true or not. However, we always
have a limitation that doing census is practically impossible. In such case, we can only base on
the information collected from a survey to test whether the assumption is likely to be correct or
likely to be wrong (with the consideration of sampling error).

The general logic of hypothesis testing is:

1. Identify the variable of interest and the null hypothesis of the test
2. If your hypothesis is correct, what do you expect to be observed from the sampled data?
3. Collect data through a survey
4. Does the collected data support your null hypothesis?

Major learning objectives of this Chapter:


➢ Understand the logic and commonly used terms when doing a hypothesis test
➢ To be able to conduct hypothesis testing for the following cases:
 z-test for a mean (with known )
 t-test for a mean (with unknown )
 z-test for a proportion
 t-test for the difference between two means: dependent samples
 z-test for the difference between two means: independent samples
 t-test for the difference between two means: independent samples
 z-test for the difference between proportions

75
Applied Statistics

Example 1

You are asked to check if the average sales per invoice this year is significantly different from
that of last year. Suppose in last year, the sales amount follows a normal distribution with
population mean $5000 and population standard deviation $1100. It is reasonable to assume
the sales amount this year follows a normal distribution with the same standard deviation as in
the previous year, however, whether the mean level has been changed significantly is not sure.

1. Use  to denote this year’s average sales per invoice


If the sales in this year is the same as in last year, that means  = 5000 (null hypothesis)

2. If the average amount of sales per invoice this year is $5000, when a survey is conducted, the
sample mean should be close to $5000 (with reasonable deviation due to sampling error).

3. Suppose a random sample of 30 invoices is collected and the sample mean is $5530.

4. Do we have strong evidence to reject the null hypothesis of  = 5000 because the difference
between $5530 and $5000 is considered to be large? Or because the difference between
$5530 and $5000 is considered to be small, so we do not have evidence to reject the null
hypothesis?

In order to do the test statistically, we need to be familiarize with the concept of sampling
distribution, sampling error and Normal distribution.

Now, let’s try to understand the following concepts and present our test in a statistical approach.

Important concepts relate to hypothesis testing


(i) Null hypothesis and alternative hypothesis
(ii) Two-tailed test and one-tailed test
(iii) Type I error and type II error
(iv) Level of significance and rejection region

76
Applied Statistics

(i) Null Hypothesis and Alternative Hypothesis

The first step when we do the hypothesis testing is to list out the hypothesis! Actually there
should be two hypotheses.

Null Hypothesis, H0, is the statement that contains the assumption about the population (the equal
sign “=” is always included).

Alternative Hypothesis, H1, is the statement that we want to test against the null hypothesis (the
equal sign “=” should not be included).

Example

In our example, regarding to the average amount of each sales invoice in this year,
H0: μ = $5000 v.s. H1: μ ≠ $5000

(ii) Two-tailed test and one-tailed test


Be careful, the alternative hypothesis can be two-sided or one-sided depending on what we try to
prove.

Example

In our example, as the test statement is


(a) You are asked to check if the average sales per invoice this year is significantly different from
last year.
As either the average is increased or decreased violate the null hypothesis, this test is named
as a two-tailed test with H1: μ ≠ $5000

However, if the test statement is changed as


(b) You are asked to check if the average sales per invoice this year is significantly higher than
last year.
Only when the average is increased violate the null hypothesis, this test is named as a one-
tailed test with H1: μ > $5000

Similarly, if the test statement is changed as


(c) You are asked to check if the average sales per invoice this year is significantly less than last
year.
Only when the average is decreased violate the null hypothesis, this test is named as a one-
tailed test with H1: μ < $5000

That means, choose your set of hypotheses from below:


(a) H0: μ = $5000 v.s. H1: μ ≠ $5000 two-tailed test
(b) H0: μ = $5000 v.s. H1: μ > $5000 one-tailed test
(c) H0: μ = $5000 v.s. H1: μ < $5000 one-tailed test

77
Applied Statistics

Revision: sampling distribution as in Chapter 6

For a continuous random variable X with population mean 𝜇 and population standard deviation
, the sample mean distribution for samples with sample size n consists the following
characteristics:

(i) 𝐸(𝑋̅) = 𝜇
𝜎2
(ii) 𝑉𝑎𝑟(𝑋̅) =
𝑛
σ
(iii) SE(𝑋̅) =
√n
(iv) 𝑋̅ is normally distributed either n ≥ 30 or X is originally normally distributed

In example 1, we have σ = $1100. If the null hypothesis of µ = $5000 is correct, when we select
a sample with sample size n = 30, we should expect the sample mean is reasonably close to $5000.
When the observed sample mean is located significantly different from $5000, with the
probability of this happening is very small, we have a reason to reject the null hypothesis
statistically. In a critical value approach, we need to set up rejection region(s) and non-rejection
region in order to help determining whether the null hypothesis should be rejected.

This is the graph for a two-tailed test. Later, we will discuss the one-tailed test.

78
Applied Statistics

(iii) Type I and Type II errors

No matter whether it is a two-sided test or one-sided test, we need to make our decision “Can the
null hypothesis be rejected?” based on the summary statistics we compiled from a sample. We
either:

1) No strong evidence do not reject H0 (H0 is likely to be correct); or


2) Strong evidence reject H0 (H0 is likely to be wrong).

As sampling error exists, sometimes we may make a wrong decision.


There are two types of errors.

H0 is true H0 is false
Do not reject H0 Type II error
Probability = 
Reject H0 Type I error
Probability = 
Type I error: The null hypothesis H0 is correct, but as a very extreme sample is obtained
which indicate a violation of the null hypothesis, so the null hypothesis is
rejected.

Type II error: The null hypothesis H0 is wrong, but as a sample very close to the null
hypothesis is obtained that the null hypothesis is perceived as true, so the null
hypothesis is not rejected.

(iv) Level of significance and rejection region


The probability of committing Type I error, α, also known as the level of significance of the test,
is decided before the test is conducted. This level of significance, with the knowledge of the
distribution of the test statistics, helps to determine the rejection region(s) of the test. Common
practice is to control the level of significance at a reasonably low level, for example, 10%, 5%,
2% or 1%.

For example, if the test statistics come from the standard normal distribution, for a two tailed test
with 5% significance level, the rejection region is when z > 1.96 or z < -1.96

79
Applied Statistics

(v) Test Statistics

The test statistics is a function of the collected data in the sample. Instead of directly compare
the sample mean to the assumption, it is more convenience if we convert the normal distribution
to the standard normal distribution.

Example
As, in our case, we want to test the population mean  = 5000, a z-score is calculated by
̅ −5000
X
Z= 1100⁄
√30
1100 2
So when 𝑋̅ ~𝑁(5000, ) is true, Z ~ N(0, 12) is true and we should expect a z-score calculated
30
from the sample should be close to 0.

➢ When we have a sample mean with the calculated z-score is reasonably close to 0, the null
hypothesis is not rejected and is considered to be correct. Otherwise, the null hypothesis
is rejected and is considered to be wrong.
➢ At a 5% significance level test, the null hypothesis is rejected when the z-score < -1.96 or z-
score > 1.96 for a two-tailed test.

General Procedures for Hypothesis Testing (Critical value approach)

Combine all the concepts together, we can present our test statistically by the following steps:

1. Define the null hypothesis and the alternative hypothesis (2-tailed or 1 tailed)
2. Define the rejection region(s) (based on the level of significance and whether it is a 2-tailed
or 1-tailed test)
3. Compile the test statistics
4. Make conclusion: (i) H0 is rejected or (ii) H0 is not rejected

The above procedures do not only apply for testing a population mean, it can also be applied to
different situations. In this chapter, we will look at many types of hypothesis testing.

80
Applied Statistics

I. z-test: Hypothesis for the Mean (σ known)

When the variable X is a quantitative variable, the null hypothesis involves the statement if the
population mean of X equals to a particular value, μ0. Sample mean collected from sample /
experiment will be used to conduct the test.

Step 1: Set up null hypothesis and alternative hypothesis as follow:

a. H0: μ = μ0 v.s. H1: μ ≠ μ0 (Two-tailed test)


b. H0: μ = μ0 v.s. H1: μ > μ0 (One-tailed test)
c. H0: μ = μ0 v.s. H1: μ < μ0 (One-tailed test)

81
Applied Statistics

Step 2: Set up rejection region(s) as:

a. H0: μ = μ0 v.s. H1: μ ≠ μ0 (Two-tailed test)

Reject null hypothesis when


Rejection region with
either
probability α, so each side with (i) z is too large (z > zα/2) or
α (ii) z is too small (z < -zα/2).
probability
2

Z
0

b. H0: μ = μ0 v.s. H1: μ > μ0 (One-tailed test)

Rejection region Reject null hypothesis when


z is too large (z > zα)
with probability α

Z
0

c. H0: μ = μ0 v.s. H1: μ < μ0 (One-tailed test)

Rejection region Reject null hypothesis when


z is too small (z < -zα)
with probability α

Z
0

82
Applied Statistics

x − 0
Step 3: Calculate the z-statistics z=
/ n

When X is normally distributed or when the sample size is large enough,


 2 
X ~ N   ,  . When the null hypothesis is true, the calculated z-statistic should
 n 
follow the standard normal distribution and is likely to result as a value closes to 0.

Step 4: When the z-score falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.

83
Applied Statistics

Example 1
You are asked to check if the average sales per invoice this year is significantly different from
that of last year. Suppose in last year, the sales amount follows a normal distribution with
population mean $5000 and population standard deviation $1100. It is reasonable to assume
the sales amount this year follows a normal distribution with the same standard deviation as in
the previous year. A sample of 30 invoices in this year indicates the sample mean is $5530.
Test, at the 5% level of significance, if the population mean this year is different from last year.

Solution
Denote X as the sales (quantitative variable) and z test is launched to test whether the population
mean is different from last year ($5000).

Step 1. H0: μ = $5000 v.s. H1: μ ≠ $5000

Step 2. Reject H0 when z < -1.96 or when z > 1.96 (z0.025 = 1.96)
5530−5000
Step 3. 𝑧= 1100 = 2.6390
√30

Step 4. As 2.6390 > 1.96, the null hypothesis is rejected.

Conclusion: There is sufficient evidence to conclude that the average sales per invoice this year
is different from last year.

84
Applied Statistics

Hypothesis testing: critical value approach v.s. p-value approach

Critical value approach


The testing procedure we are using is called critical value approach (or classical approach). By
assuming the null hypothesis is correct and the sampling distribution of the test statistics is known,
rejection region(s) is set up. When the observed test statistics is more extreme than the critical
value, the null hypothesis is likely to be wrong and so to be rejected.

1. H0: μ = $5000 v.s. H1: μ ≠ $5000


2. Reject H0 when z < -1.96 or when z > 1.96 (z0.025 = 1.96)
5530−5000
3. 𝑧= 1100 = 2.6390
√30

4. As 2.6390 > 1.96, the null hypothesis is rejected.

p-value approach
Another testing procedure to do the hypothesis testing is called p-value approach. By assuming
the null hypothesis is correct and the sampling distribution of the test statistics is known, the
chance to have a more extreme test statistics than the one is observed is compiled as the p-value.
If the p-value is smaller than the significance level, the null hypothesis is rejected; otherwise, if
the p-value is greater than or equal to the significance level, the null hypothesis is not rejected.
For two tailed test,
Redo the test in Example in p.82 the p-value is 2 P(Z
1. H0: μ = $5000 v.s. H1: μ ≠ $5000 > |z|). If the p-
value is small, the
5530−5000 test statistics would
2. 𝑧= 1100 = 2.6390, fall in the rejection
√30 region as in the
critical value
p-value = P(Z > 2.6390) + P(Z < -2.6390) = 0.0082
approach.

3. The significance level = 0.05


4. As p-value 0.0082 < 0.05, the null hypothesis is rejected at 5% significance level

85
Applied Statistics

General Procedures for Hypothesis Testing (p-value approach)

1. Define the null hypothesis and the alternative hypothesis (2-tailed or 1 tailed)
2. Compile the test statistics
3. Find the p-value
4. Make conclusion: (i) H0 is rejected or (ii) H0 is not rejected

Finding p-value

The p-value of the test statistics aims to indicate how extreme is the observation. For the three
sets of alternative hypotheses, there are three corresponding methods to check the p-value.

a. H0: μ = μ0 v.s. H1: μ ≠ μ0 (Two-tailed test)

p-value is the 2 P(Z > |z|)

b. H0: μ = μ0 v.s. H1: μ > μ0 (One-tailed test)

p-value is P(Z > z)

c. H0: μ = μ0 v.s. H1: μ < μ0 (One-tailed test)

p-value is P(Z < z)

Making conclusion

If the p-value is smaller than the significance level, the null hypothesis is rejected.
(The test statistics would fall in the rejection region as in the critical value approach)

If the p-value is greater than or equal to the significance level, the null hypothesis is not rejected.
(The test statistics would fall in the non-rejection region as in the critical value approach)

86
Applied Statistics

II. t-test: Hypothesis for the Mean (σ unknown)

Practically, when we work out the hypothesis testing for the population mean, we may not know
the population standard deviation. In this case, sample standard deviation is used to replace the
population standard deviation when calculating the test statistics together with normal
distribution is replaced by t distribution for checking the rejection region.

Step 1: The null hypothesis and alternative hypothesis are set up similarly as in test I.

a. H0: μ = μ0 v.s. H1: μ ≠ μ0 (Two-tailed test)


b. H0: μ = μ0 v.s. H1: μ > μ0 (One-tailed test)
c. H0: μ = μ0 v.s. H1: μ < μ0 (One-tailed test)

Step 2: Set up rejection region(s) from t-distribution with degree of freedom = n - 1 as:

Test Rejection Region


either t is too large (t > tα/2) or
a. H0: μ = μ0 v.s. H1: μ ≠ μ0
t is too small (t < -tα/2).
b. H0: μ = μ0 v.s. H1: μ > μ0 t is too large (t > tα).
c. H0: μ = μ0 v.s. H1: μ < μ0 t is too small (t < -tα).

x − 0
Step 3: Calculate the t-statistics t=
s/ n

If variable X is normally distributed without the value of population standard deviation


and estimated by the sample standard deviation, then when the null hypothesis H0: μ =
μ0 is true, the above t-statistics follows the t distribution with n-1 degrees of freedom
and is expected to be close to 0.

Step 4: When the t statistics falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.

Remark:
If the t-test is presented in p-value approach, the p-value should be found from t-distribution with
the corresponding degree of freedom instead of locating from the normal table.

87
Applied Statistics

Example 2
The manufacturer claims that the volume of soft drink in a bottle follows a normal distribution
with mean 2 liters. A sample of 20 2-liters bottles are selected and found to have the sample
mean of 1.98 liters and standard deviation of 0.18 liters. According to the survey result, is there
evidence that the population mean amount of soft drink filled is less than 2.0 liters at the 0.05
level of significance?

Solution
Denote X as the volume of soft drink in a bottle (quantitative variable) and the test is about the
population mean is less than 2.0 liters.

Step 1. H0: μ = 2 v.s. H1: μ < 2

Step 2. With unknown population standard deviation and sample size n = 20, i.e. d.f. = 19
Reject the null hypothesis when t < -1.729 (t19, 0.05 = 1.729)
1.98 − 2
Step 3. t= = −0.4969
0.18 / 20

Step 4. As -1.729 < -0.4969, null hypothesis is not rejected.

Conclusion: There is no evidence to say that the population mean amount of soft drink per bottle
is less than 2.0 liters.

88
Applied Statistics

III. z-test: Hypothesis for the Proportion

When the variable X is a qualitative variable, the null hypothesis may involve the statement
saying if the proportion of one option equals to a particular value. For example, you may want
to test whether a coin is fair by assuming the proportion of head obtained in a long run equals to
0.5. Sample proportion collected from survey / experiment would be used to conduct the test.
Again, three forms of alternative hypothesis may be resulted as:

Step 1: Set up null hypothesis and alternative hypothesis as follow:

a. H0: p = p0 v.s. H1: p ≠ p0 (Two-tailed test)


b. H0: p = p0 v.s. H1: p > p0 (One-tailed test)
c. H0: p = p0 v.s. H1: p < p0 (One-tailed test)

Step 2. Set up rejection region(s) from standard normal distribution as:

Test Rejection Region


a. H0: p = p0 v.s. H1: p ≠ p0 either z is too large (z > zα/2) or
z is too small (z < -zα/2)
b. H0: p = p0 v.s. H1: p > p0 z is too large (z > zα)
c. H0: p = p0 v.s. H1: p < p0 z is too small (z < -zα)

𝑝̂−𝑝0
Step 3: Calculate z-statistics as 𝑍=
𝑝 (1−𝑝0 )
√ 0
𝑛

 p(1 − p) 
As pˆ ~ N  p,  , when the null hypothesis, H0: p = p0, is true, the above z-
 n 
statistics should follow the standard normal distribution and is likely to result as a value
closes to 0.

Step 4. When the z-score falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.

89
Applied Statistics

Example 3
A coin is suspected if it is fair. This single coin is tossed 200 times and 120 heads are obtained.
Test, at the 0.10 level of significance, if the coin is fair?

Solution
Let p be the population proportion of head. When the coin is fair, p = 0.5.

Step 1. H0: p = 0.5 v.s. H1: p ≠ 0.5

Step 2. H0 is rejected if z < -1.645 or z > 1.645 (z0.05 = 1.645)


120
− 0.5
Step 3. 𝑧= 200
= 2.8284
√(0.5)(1−0.5)/200

Step 4. As 2.8284 > 1.645, the null hypothesis is rejected.

Conclusion: With the hypothesis that the coin is fair is rejected, the coin is concluded as unfair.

90
Applied Statistics

In the coming sections, we look at testing involving two populations

IV. t-test: Hypothesis for the Difference between Two Means (Dependent)

Sometimes, two measurements would be recorded from the same subject and a comparison
between the two measurements are required. For example, every student have to take two
assessments, the mid-term examination and final examination; or the blood pressure of each
patient before and after taking the medicine are recorded. In this case, the difference D, for each
set of dependent measurements is compiled and the test of any significant difference between the
two populations would have the null hypothesis H0: µD = 0, which is converted as one population
test. In the coming example, student performance in mid-term examination and final
examination will be compared so that the hypothesis that student performs better in the final
examination can be conducted.

Step 1: Set up null hypothesis and alternative hypothesis as follow:

a. H0:  D = 0 v.s. H1:  D  0 (Two-tailed test)


b. H0:  D = 0 v.s. H1:  D  0 (One-tailed test)
c. H0:  D = 0 v.s. H1:  D  0 (One-tailed test)

Step 2: Set up rejection region(s) from t-distribution with degree of freedom = n – 1 as:
Test Rejection Region
a. H0:  D = 0 v.s. H1:  D  0 either t is too large (t > tα/2) or
t is too small (t < -tα/2).
b. H0:  D = 0 v.s. H1:  D  0 t is too large (t > tα).
c. H0:  D = 0 v.s. H1:  D  0 t is too small (t < - tα).

xD
Step 3: Calculate the t-statistics as t=
sD / n

When the null hypothesis is true, the above t-statistics would follow the T-distribution
with degrees of freedom n – 1.

Step 4: When the t statistics falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.

Remark:
Define D clearly and keep it consistent for the whole test.

91
Applied Statistics

Example 4
The following is the results of a sample of 8 students from a school. Test at the 0.05 level of
significance if students perform better in the end of term examination.

Mid-term examination 52 58 63 78 61 70 82 74
End-of-term examination 58 55 69 79 64 68 90 77

Solution
The test is about if students perform better in the end of term examination.
Define D = End of term examination – Mid term examination

d: 6 -3 6 1 3 -2 8 3

Step 1. H0: µD = 0 v.s. H1: µD > 0

Step 2. There are 8 students in the sample, d.f. of the t-test is 8 – 1 = 7


Reject H0 if t >1.895 (t7, 0.05)

2.75
Step 3. t= = 1.9848
3.9188/ 8

Step 4. As 1.9848 > 1.895, the null hypothesis is rejected at the 5% significance level.

Conclusion: There is sufficient evidence to conclude that students perform better in the end of
term examination.

92
Applied Statistics

V. z-Test: Hypothesis for the Difference between Means (Independent)

Suppose there are two independent populations (for example, male v.s. female, or new production
line v.s. old production line). We may need to test if the population means (as variable is
quantitative) of these two independent populations are the same. Typical example is to compare
the spending power of male customers and female customers. In such case, independent
samples would be selected separately from the two populations and a comparison of the sample
means would be made with the consideration of the possible sampling errors.

Step 1: The null hypothesis and alternative hypothesis are set up as:

a. H0: 1 −  2 = 0 v.s. H1: 1 −  2  0 (Two-tailed test)


b. H0: 1 −  2 = 0 v.s. H1: 1 −  2  0 (One-tailed test)
c. H0: 1 −  2 = 0 v.s. H1: 1 −  2  0 (One-tailed test)

Step 2: Set up rejection region(s) from standard normal distribution as:


Test Rejection Region
H0: 1 −  2 = 0 v.s. H1: 1 −  2  0 either z is too large (z > zα/2) or
z is too small (z < -zα/2)
H0: 1 −  2 = 0 v.s. H1: 1 −  2  0 z is too large (z > zα)
H0: 1 −  2 = 0 v.s. H1: 1 −  2  0 z is too small (z < -zα)

z=
(x1 − x2 )
Step 3: Calculate the z-statistics as
 12  22
+
n1 n2

If the two independent samples are large enough, or if the two populations follow
independent normal distributions, then the two sample means follow independent
normal distributions,
𝜎 2 𝜎 2
𝑋̅1 ~𝑁(𝜇1 , 1 ) and 𝑋̅2 ~𝑁(𝜇2 , 2 )
𝑛1 𝑛2

When many possible pairwise comparison between sample from population 1 and
2 2
sample from population 2 are made, 𝑋 ̅1 − 𝑋̅2 ~ 𝑁(𝜇1 − 𝜇2 , 𝜎1 + 𝜎2 )
𝑛1 𝑛2
If the null hypothesis is true, the above z-statistics should follow the standard normal
distribution.

Step 4: When the z-score falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.

93
Applied Statistics

Example 5
The manager of a supermarket wants to find evidence to support the assumption that the average
spending made by female customers is significantly more than that by male customers. It is
assumed that the spending made by female customers follows a normal distribution with
uncertain mean and standard deviation of $125. For the male customers, the spending is
supposed to follow a normal distribution with uncertain mean and standard deviation of $110.
A random sample of 23 female has a mean spending of $375 and another independent sample of
25 male has a mean spending of $362. Is there any evidence of a higher level of mean spending
is made by female customers than male customers at the 0.05 level of significance?

Solution
Define X as the spending of each customer (quantitative variable). The test is about the
population mean of female (F) is higher than the population mean of male (M).

Step 1. H0: F - M = 0 v.s. H1: F - M > 0

Step 2. Reject H0 if z > 1.645 (z0.05 = 1.645)


375−362
Step 3. 𝑧= 2 2
= 0.3811
√125 + 110
23 25

Step 4. As 0.3811 < 1.645, H0 is not rejected.

Conclusion: There is no evidence to say that on the average the female customers spend more
than male customers.

94
Applied Statistics

Reading the Excel output

The calculation of the test statistics becomes complicated when it involves two samples. Below
is the report generated by Excel with the test conducted at 0.05 level of significance. Let’s read
it and generate a hypothesis testing report and derive the conclusion from it.

z-Test: Two Sample for Means

Female Male (Explanation)


Mean 375 362 (sample mean)
Known Variance 15625 12100 (population variance)
Observations 23 25 (sample size)
Hypothesized Mean Difference 0 (null hypothesis)
z 0.3811 (z-statistics)
P(Z<=z) one-tail 0.3515 (p-value for one-tailed test)
z Critical one-tail 1.6449 (critical value for one-tailed test)
P(Z<=z) two-tail 0.7031 (p-value for two-tailed test)
z Critical two-tail 1.9600 (critical value for two-tailed test)

The first few lines of the report are straight forward which give a summary of the two datasets.
Be aware that the Excel output presents both results of one-tailed test and two-tailed test. You
need to pick up the appropriate set of result according to the alternative hypothesis so to generate
the report.

As in our example, the alternative hypothesis is one-tailed test (right tailed), the following report
can be derived from the Excel output:

Report in critical value approach


Define X as the spending of each customer (quantitative variable). The test is about the
population mean of female (F) is higher than the population mean of male (M).

Step 1. H0: F - M = 0 v.s. H1: F - M > 0

Step 2. Reject H0 if z > 1.6449 (Critical value for one-tailed test, right tailed)

Step 3. z statistics is calculated as 0.3811

Step 4. As 0.3811 < 1.645, H0 is not rejected.

Conclusion: There is no evidence to say that on the average the female customers spend more
than male customers.

375−362
Remark: 𝑧 = 15625 12100
= 0.3811
√ +
23 25

95
Applied Statistics

Report in p-value approach


Define X as the spending of each customer (quantitative variable). The test is about the
population mean of female (F) is higher than the population mean of male (M).

Step 1. H0: F - M = 0 v.s. H1: F - M > 0

Step 2. z statistics is calculated as 0.3811

Step 3. p-value = 0.3515

Step 4. As p-value = 0.3515 > 0.05, the null hypothesis is not rejected.

Conclusion: There is no evidence to say that on the average the female customers spend more
than male customers.

96
Applied Statistics

VI. t-Test: Hypothesis for the Difference between Means (Independent)

We want to do the similar test as the session (V). However, this time the population variances
of the two independent populations are unknown. In this case, we need to estimate the variances
by the sample variances. Instead of estimate the population variances separately, we need to

a. confirm / assume the two independent populations are normally distributed and
b. the population variances are unknown but equal. Then the pooled-variance is estimated by
(n − 1) s12 + (n2 − 1) s 22
s 2p = 1 .
(n1 − 1) + (n2 − 1)

s2
=
 (x − x) 2
s 2
=
 ( y − y) 2

1 2
(n1 − 1) (n2 − 1)

s 2
=
 ( x − x) +  ( y − y)
2 2
Do you remember
p
(n1 − 1) + (n2 − 1) the calculation of
(n1 − 1) s12 + (n2 − 1) s22 weighted average?
=
(n1 − 1) + (n2 − 1)
Step 1: The null hypothesis and alternative hypothesis are set up as:

a. H0: 1 −  2 = 0 v.s. H1: 1 −  2  0 (Two-tailed test)


b. H0: 1 −  2 = 0 v.s. H1: 1 −  2  0 (One-tailed test)
c. H0: 1 −  2 = 0 v.s. H1: 1 −  2  0 (One-tailed test)

Step 2: Set up rejection region(s) from t-distribution with degree of freedom = n1 + n2 – 2 as:

Test Rejection Region


a. H0: 1 −  2 = 0 v.s. H1: 1 −  2  0 either t is too large (t > tα/2) or
t is too small (t < -tα/2).
b. H0: 1 −  2 = 0 v.s. H1: 1 −  2  0 t is too large (t > tα).
c. H0: 1 −  2 = 0 v.s. H1: 1 −  2  0 t is too small (t < - tα).

x1 − x 2
Step 3: Calculate the t-statistics as t=
1 1 
s 2p  + 
 n1 n2 

The similar calculation of the test statistics in the previous test with the population
variances are replaced by the pooled variance.

Step 4: When the t statistics falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.

97
Applied Statistics

Example 6

The marketing team wants to know if the product being displayed in different position in the
supermarket would have a significant effect to the sales performance. The sales performance
of Billy cola is used to conduct a test. The following is the weekly sales of Billy cola collected
by different supermarkets with the colas are displayed in (a) normal shelf and (b) promotion area.
Test at the 0.01 level of significance if the average weekly sales of Billy cola when displayed in
promotion area is higher than when it is displayed in normal shelf. Assume the sales from both
display types follow normal distribution with equal variance.

(a) Weekly sales when 32 34 52 62 30


displayed in Normal
shelf 40 64 52 56 59

(b) Weekly sales when 52 71 76 54 67


displayed in
promotion area 83 66 90 77 84

Below is the Excel output with the test conducted at 0.01 level of significance:

t-Test: Two-Sample Assuming Equal Variances

Normal Promotion
shelf area
Mean 48.1 72
Variance 167.6556 157.3333
Observations 10 10
Pooled Variance 162.4944
Hypothesized Mean Difference 0
df 18
t Stat -4.1924
P(T<=t) one-tail 0.0003
t Critical one-tail 2.5524
P(T<=t) two-tail 0.0005
t Critical two-tail 2.8784

98
Applied Statistics

Report in critical value approach


Define X as the weekly sales of Billy cola (quantitative). The test is about the population mean
weekly sales from supermarket with items displayed in promotion area (Pro) is higher than the
population mean weekly sales from supermarket with items displayed in normal shelf (N).

Step 1. H0: N - Pro = 0 v.s. H1: N - Pro < 0

Step 2. Reject H0 when t < -2.552 (Critical value for one-tailed test, left-tailed)

Step 3. t statistics is calculated as -4.1924

Step 4. As -4.1924 < -2.552, the null hypothesis is rejected.

Conclusion: There is sufficient evidence indicating that the average weekly sales of Billy cola
displayed at the promotion area is higher than that being displayed in the normal shelf.

48.1− 72
Remark: t = = −4.1924
1 1
162.4944 + 
 10 10 
Report in p-value approach
Define X as the weekly sales of Billy cola (quantitative). The test is about the population mean
weekly sales from supermarket with items displayed in promotion area (Pro) is higher than the
population mean weekly sales from supermarket with items displayed in normal shelf (N).

Step 1. H0: N - Pro = 0 v.s. H1: N - Pro < 0

Step 2. t statistics is calculated as -4.1924

Step 3. p-value = 0.0003

Step 4. As 0.0003 < 0.01, the null hypothesis is rejected at 1% level of significance.

Conclusion: There is sufficient evidence indicating that the average weekly sales of Billy cola
displayed at the promotion area is higher than that being displayed in the normal shelf.

99
Applied Statistics

VII. z-Test: Hypothesis for the Difference between Proportions

Here, we would like to compare the population proportions from two independent populations.
In marketing study, we always want to test whether male and female response similarly to a
particular product. In the coming example, we need to conclude whether the proportion of
male prefer Chinese tea to Japanese would be similar to the proportion of female prefer Chinese
tea to Japanese tea. When testing the null hypothesis that the two independent population
proportions are the same, H0: p1 = p 2 , we need to use the sample data to generate three
proportions, 𝑝̂1as sample proportion for population 1, 𝑝̂2 as sample proportion for population 2,
n pˆ + n2 pˆ 2
a pooled sample proportion by combining all data together, pˆ = 1 1 .
n1 + n2
Step 1: Set up null hypothesis and alternative hypothesis as follow:

a. H0: p1 − p 2 = 0 v.s. H1: p1 − p 2  0 (Two-tailed test)


b. H0: p1 − p 2 = 0 v.s. H1: p1 − p 2  0 (One-tailed test)
c. H0: p1 − p 2 = 0 v.s. H1: p1 − p 2  0 (One-tailed test)

Step 2: Set up rejection region(s) from standard normal distribution as:

Test Rejection Region


a. H0: p1 − p 2 = 0 v.s. H1: p1 − p 2  0 either z is too large (z > zα/2) or
z is too small (z < -zα/2)
b. H0: p1 − p 2 = 0 v.s. H1: p1 − p 2  0 z is too large (z > zα)
c. H0: p1 − p 2 = 0 v.s. H1: p1 − p 2  0 z is too small (z < -zα)

Step 3: Calculate the z-statistics as pˆ 1 − pˆ 2


z=
1 1 
pˆ (1 − pˆ ) + 
 n1 n2 

The two independent sample proportions are resulted from the following two normal
distributions:
𝑝 (1−𝑝 ) 𝑝 (1−𝑝 )
𝑝̂1 ~𝑁(𝑝1 , 1 𝑛 1 ) 𝑝̂2 ~𝑁(𝑝2 , 2 𝑛 2 )
1 2
When many possible pairwise comparison between sample from population 1 and
𝑝 (1−𝑝 ) 𝑝 (1−𝑝 )
sample from population 2 are made, 𝑝̂1 − 𝑝̂2 ~𝑁(𝑝1 − 𝑝2 , 1 𝑛 1 + 2 𝑛 2 )
1 2
When the null hypothesis is true: p1 = p2 = p, it becomes
𝑝(1−𝑝) 𝑝(1−𝑝)
𝑝̂1 − 𝑝̂2 ~𝑁(0, + )
𝑛1 𝑛2
The above z-statistics should follow the standard normal distribution.

Step 4: When the z-score falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.

100
Applied Statistics

Example 7

A “tea-lover” group wants to know what kind of tea is more preferable for male and female.
The marketing team has invited 200 male and 180 female to try two types of tea, Chinese tea and
Japanese tea. The result of the survey indicates that 70% male in the sample prefers Chinese
tea and 65% of female in the sample prefers Chinese tea. Test, at the 0.05 level of significance,
if the proportion of male prefers Chinese tea and proportion of female prefers Chinese tea in the
population are significantly different.

Below is the Excel output with the test conducted at 0.05 level of significance:

z-Test: Two Sample for Proportions

Male Female
Proportion 0.7 0.65
Variance 0.2189 0.2189
Observations 200 180
Hypothesized Proportion Difference 0
z 1.0401
P(Z<=z) one-tail 0.1491
z Critical one-tail 1.6449
P(Z<=z) two-tail 0.2983
z Critical two-tail 1.9600

Report in critical value approach


Define p as the proportion of people prefer Chinese tea to Japanese tea. The test is about the
population proportion of male prefers Chinese tea (pM) and population proportion of female
prefers Chinese tea (pF) are different.

Step 1. H0: p M − p F = 0 v.s. H1: p M − p F  0

Step 2. Reject H0 when z < -1.96 or when z > 1.96 (Critical value for two-tailed test)

Step 3. z statistics is calculated as 1.0401

Step 4. As -1.96 < 1.0401 < 1.96, H0 is not rejected at 5% significance level.

Conclusion: There is no evidence to say that the proportion of male prefers Chinese tea and
proportion of female prefers Chinese tea in the population are different.

0.7 − 0.65
Remark: z = = 1.0401
 1 1 
0.6763(0.3237) + 
 200 180 

101
Applied Statistics

Report in p-value approach


Define p as the proportion of people prefer Chinese tea to Japanese tea. The test is about the
population proportion of male prefers Chinese tea (pM) and population proportion of female
prefers Chinese tea (pF) are different.

Step 1. H0: p M − p F = 0 v.s. H1: p M − p F  0

Step 2. z statistics is calculated as 1.0401

Step 3. p value = 0.2983

Step 4. As 0.2983 > 0.05, H0 is not rejected at 5% significance level.

Conclusion: There is no evidence to say that the proportion of male prefers Chinese tea and
proportion of female prefers Chinese tea in the population are different.

102
Applied Statistics

Appendix (Reference Reading)

Running t-test with 2 samples by Excel


Here is a brief introduction about how to run the t-test with 2 samples by Excel for your quick
reference. Below is the example in page 98.

1. Check that Add-ins function “Analysis ToolPak” has been installed before your start your
work.
2. Input data in the Excel worksheet.

3. From the menu bar, select Data > Data Analysis > t-Test: Two-Sample Assuming Equal
Variances.

4. Input Range for dataset 1, dataset 2, set null hypothesis value as 0, tick the box Labels if
you have included variable name in your dataset, and input alpha value which is the level of
significance of the test.

103
Applied Statistics

Useful formulae

One population test


I: H0:  = 0 x − 0
z=
/ n
II: H0:  = 0 x − 0
t=
s/ n
III: H0: p = p0 𝑝̂ − 𝑝0
𝑧=
√𝑝0 (1 − 𝑝0 )
𝑛

Two dependent populations


IV: H0:  D = 0 xD
t=
sD / n

Two independent populations


V: H0: 1 −  2 = 0
z=
(x1 − x2 )
 12  22
+
n1 n2
VI: H0: 1 −  2 = 0 x1 − x 2
t=
1 1 
s 2p  + 
 n1 n2 
(n1 − 1) s12 + (n2 − 1) s 22
where s = 2
p
(n1 − 1) + (n2 − 1)
VII: H0: p1 − p 2 = 0 pˆ 1 − pˆ 2
z=
1 1 
pˆ (1 − pˆ ) + 
 n1 n2 
n1 pˆ 1 + n2 pˆ 2
where pˆ =
n1 + n2

104
Applied Statistics

Chapter 9 Analysis of Variance


In the previous chapter, we discussed the general procedure of conducting hypothesis testing to
reach conclusion about mean for one population (test I, II), comparison of means for two
populations (test IV, V, VI), proportion for one population (test III), and comparison of
proportions for two populations (test VII). In this chapter, we will discuss the test related to the
comparison of means for more than two populations.

In many situations, you need to examine differences of means of a quantitative variable among
many groups of individuals. For example, the policy maker may want to compare the traveling
expense for people living in different districts before suggesting the traveling expense allowance
scheme. By grouping residentials in different areas (Hong Kong Island, Kowloon, New
Territories), the objective is to test if the mean traveling expense among different groups are all
the same (null hypothesis) or the mean traveling expense among different groups are not all the
same (alternative hypothesis). This kind of test is named as one way analysis of variance
(ANOVA).

Major learning objectives of this Chapter:


➢ Be able to perform one-way ANOVA F test

One way ANOVA test is particularly used for quantitative variable with more than 2 populations
(groups). Assuming that c groups represent populations whose values are randomly and
independently selected, follow a normal distribution, and have equal variances. The null
hypothesis of no differences in the population means against the alternative hypothesis that not
all the c population means are equal:

H0: μ1 = μ2 = … = μc
H1: not all μj are equal (where j = 1, 2, …, c)

Imagine data (traveling expense to work on 3 September 2019) collected for the test about the
traveling expense is as below:

Group 1 (Hong Kong Island) 5 8 11


Group 2 (Kowloon) 7 8 9 11
Group 3 (New Territories) 9 13 14 14

105
Applied Statistics

One-way ANOVA F test

F statistics based on the provided data should be calculated in order to justify if there is sufficient
evidence to reject the null hypothesis. In order to calculate the F statistics, you should start with
preparing the summary for the data

Group Number of data Mean


1 n1 𝑥̅ 1
2 n2 𝑥̅ 2
… … …
c nc 𝑥̅ C
Total n = n1 + n2 + … nc ̿
𝒙

As in our example, the summary for the data set with c = 3,


Group Number of data Mean
1 3 8
2 4 8.75
3 4 12.5
Total 11 9.9091

Referring to the above summary table, you should aware the sample means for group 1, group 2,
group 3 are not the same. With the consideration of possible sampling error, how would we
justify the differences between the means are reasonably small due to random error or it is
because the population means are not all the same?

You need to fill up the ANOVA summary table to calculate the F statistics as

Source of SS d.f. MS F statistics


Variation
Between Group (ii) SSA c–1 𝑆𝑆𝐴 𝑀𝑆𝐴
MSA = 𝑐−1
F= 𝑀𝑆𝑊
2
∑ 𝑛𝑗 (𝑥̅𝑗 − 𝑥̿ )
Within Group (iii) SSW n-c MSW =
𝑆𝑆𝑊
2 𝑛−𝑐
∑(𝑥 − 𝑥̅𝑗 ) = SST - SSA
Total (i) SST n -1
∑(𝑥 − 𝑥̿ )2

SST: measure the total variation between data to the overall mean
SSA: measure the total variation between group mean to the overall mean
SSW: measure the total variation between data to the group mean

106
Applied Statistics

Component MSA measures variation between group mean to the overall mean while the MSW
measures the variation between data to the group mean. If the null hypothesis is true, the
calculated F statistics should be close to 1. When the calculated F statistics is significantly large,
it is a strong evidence to reject the null hypothesis.

In our example, the ANOVA summary table is:

Source of SS d.f. MS F statistics


Variation
Between Group 3(8 − 9.9091)2 3–1 43.1591 21.5796
+ 4(8.75 − 9.9091)2 2 5.4688
+4(12.5 − 9.9091)2
= 43.1591 = 21.5796
=2 = 3.9459
Within Group 86.9091 – 43.1591 11 – 3 43.75
8
= 43.75 =8 = 5.4688
Total (5 − 9.9091)2 + (8 − 9.9091)2 11 – 1
+ ⋯ + (14 − 9.9091)2
= 86.9091 = 10

With the calculated F statistics as 3.9459, you need to compare this with the critical value found
from the F distribution with degree of freedom (2, 8)

107
Applied Statistics

F Distribution

F distribution is a family of distributions. It is a right-skewed distribution. All F statistics is


either 0 or positive. F distribution has two degree of freedoms, r1 and r2.

For ANOVA test, degree of freedoms are defined as:


r1 = number of groups – 1, c - 1
r2 = number of data – number of group, n – c

The next two pages show the critical value for F distribution with degree of freedoms r1 and r2 at
5% and 1% level of significance. When the resulted F-statistics is greater than the
corresponding critical value, it is a strong evidence that the null hypothesis that all means are the
same to be rejected.

108
Applied Statistics

The entries in Table III are values for which the area to their right under the F
distribution with given degrees of freedom (the gray area in the figure) is equal
to 

TABLE III VALUE OF F0.05

Degrees of freedom for numerator, r1


1 2 3 4 5 6 7 8 9 10
1 161 200 216 225 230 234 237 239 241 242
2 18.5 19.0 19.2 19.2 19.3 19.3 19.4 19.4 19.4 19.4
3 10.1 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98
Degrees of freedom for denominator, r2

11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91
 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83

109
Applied Statistics

The entries in Table IV are values for which the area to their right under the F
distribution with given degrees of freedom (the gray area in the figure) is equal
to 

TABLE IV VALUE OF F0.01

Degrees of freedom for numerator, r1


1 2 3 4 5 6 7 8 9 10
1 4,052 5,000 5,403 5,625 5,764 5,859 5,928 5,982 6,023 6,056
2 98.5 99.0 99.2 99.3 99.3 99.3 99.4 99.4 99.4 99.4
3 34.1 30.8 29.5 28.7 28.2 27.9 27.7 27.5 27.4 27.2
4 21.2 18.0 16.7 16.0 15.5 15.2 15.0 14.8 14.7 14.6
5 16.3 13.3 12.1 11.4 11.0 10.7 10.5 10.3 10.2 10.1
6 13.7 10.9 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87
7 12.2 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62
8 11.3 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81
9 10.6 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26
10 10.0 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85
Degrees of freedom for denominator, r2

11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10
14 8.86 6.52 5.56 5.04 4.70 4.46 4.28 4.14 4.03 3.94
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.90 3.81
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69
17 8.40 6.11 5.19 4.67 4.34 4.10 3.93 3.79 3.68 3.59
18 8.29 6.01 5.09 4.58 4.25 4.02 3.84 3.71 3.60 3.51
19 8.19 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26
23 7.88 5.66 4.77 4.26 3.94 3.71 3.54 3.41 3.30 3.21
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17
25 7.77 5.57 4.68 4.18 3.86 3.63 3.46 3.32 3.22 3.13
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47
 6.64 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32

110
Applied Statistics

Procedure for one-way ANOVA test

In order to justify if the means of c populations are the same, one-way ANOVA test should be
conducted as follow:

Step 1. Define the null hypothesis and the alternative hypothesis


H0: μ1 = μ2 = … = μc
H1: not all μj are equal (where j = 1, 2, …, c)

Step 2. Set up rejection region as


F > F, r1, r2
where r1 is the degrees of freedom (c -1),
r2 is the degrees of freedom (n – c)
(It is always a one-tailed test)

Step 3. Compile the test statistics


2
∑ 𝑛𝑗 (𝑥
̅̅̅−𝑥̿
𝑗 ) /(𝐶−1)
F= 2
∑(𝑥−𝑥
̅̅̅)
𝑗 /(𝑛−𝑐)

Step 4. When the F statistics falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.

111
Applied Statistics

Example
The Social Welfare Department is doing a research about the traveling expense. Traveling
expense to work on 3 September 2019 are collected for samples of individual living in Hong
Kong Island, Kowloon, and New Territories. Test, at 5% significance level, if the mean
traveling expense by resident living in Hong Kong Island, Kowloon, New Territories are the same
at 5% significance level.

Group 1 (Hong Kong Island) 5 8 11


Group 2 (Kowloon) 7 8 9 11
Group 3 (New Territories) 9 13 14 14

Solution
Step 1: H0: µHK Island = µKowloon = µNT
H1: not all μ are equal

Step 2: Reject H0 when F > 4.46 (F 0.05, 2, 8)

Step 3:
Group Number of data Mean
1 3 8
2 4 8.75
3 4 12.5
Total 11 9.9091

Source of SS d.f. MS F statistics


Variation
Between Group 3(8 − 9.9091)2 3–1 43.1591 21.5796
+ 4(8.75 − 9.9091)2 2 5.4688
+4(12.5 − 9.9091)2
= 43.1591 = 21.5796
=2 = 3.9459
Within Group 86.9091 – 43.1591 11 – 3 43.75
8
= 43.75 =8 = 5.4688
Total 2
(5 − 9.9091) + (8 − 9.9091) 2 11 – 1
+ ⋯ + (14 − 9.9091)2
= 86.9091 = 10

Step 4: As F = 3.9459 < 4.46, H0 is not rejected.

Conclusion: The average traveling expense for residents living in Hong Kong Island, Kowloon,
and New Territories are concluded to be the same.

112
Applied Statistics

Reading the Excel output

When the data set is getting large, the calculation of F statistics become challenging. Below
shows the output report by running the one-way ANOVA in Excel at 0.05 level of significance:

Anova: Single Factor

SUMMARY
Groups Count Sum Average Variance
HK Island 3 24 8 9
Kowloon 4 35 8.75 2.916667
NT 4 50 12.5 5.666667

ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 43.15909 2 21.57955 3.945974 0.064217 4.45897
Within Groups 43.75 8 5.46875

Total 86.90909 10

Referring to the above report, you can report the result of the test in critical value approach
as:

Step 1: H0: µHK Island = µKowloon = µNT


H1: not all μ are equal

Step 2: Reject H0 when F > 4.4590

Step 3: F statistics is calculated as 3.9460

Step 4: As F = 3.9460 < 4.4590, H0 is not rejected at 0.05 level of significance

Conclusion: The average traveling expense for residents living in Hong Kong Island, Kowloon,
and New Territories are concluded to be the same.

You can report the result of the test in p-value approach as:

Step 1: H0: µHK Island = µKowloon = µNT


H1: not all μ are equal

Step 2: F statistics is calculated as 3.9460

Step 3: p-value = 0.0642

Step 4: As p-value = 0.0642 > 0.05, H0 is not rejected at 0.05 level of significance

Conclusion: The average traveling expense for residents living in Hong Kong Island, Kowloon,
and New Territories are concluded to be the same.

113
Applied Statistics

Chapter 10 Chi Square Test


In this chapter, we will continue the discussion of hypothesis-testing apply for qualitative data.
Chi square test, also names as goodness of fit test will be used to test if the observed data fits
well to the proposed probability function.

Major learning objectives of this Chapter:


➢ Be able to perform Chi-square test

Goodness of Fit Test


Goodness of fit test is a statistical test which validate the null hypothesis that the observed data
follows a particular probability function. Based on the assumed probability distribution
function, expected frequency for each category is calculated. This expected frequency is
compared with the observed frequency. A large difference between the observed and expected
frequencies indicates a significant violation to the null hypothesis.

Imagine a fair die is tossed 120 times. As the die is fair, we expect among 120 tossing, each
face of number 1, 2, 3, 4, 5, 6 would be observed for 20 times. Practically, it is not surprised if
some deviations from the expected frequencies is observed (for example, number 1 is observed
22 times instead of 20 times). Consider the following chi-square statistics is calculated:

(Oi − Ei) 2
2 = 
i Ei

where Ei is the expected frequency in group i, and


Oi is the observed frequency in group i

Number 1 2 3 4 5 6
Expected Frequency 20 20 20 20 20 20
Observed Frequency

A long series of simulation by computer program suggests that if the experiments are conducted
repeatedly (for example, 100000 times) and each time the corresponding chi-square statistics is
compiled, the chi-square statistics should follow chi-square distribution with degrees of freedom
k-1, where k is the number of subgroups in this variable. In our case, k = 6 so the degree of
freedom is 5.

114
Applied Statistics

Chi-Square Distribution

Chi-square distribution is a right-skewed distribution. All chi-


square statistics takes value 0 or positive because of the squared
difference. Degrees of freedom is defined as the number of
subgroup minus 1 (k-1).

Assuming we do the tossing experiments repeatedly. By


referring to the table in the next page, 95% cases should have the
chi-square statistics takes value between 0 to 11.070, while 99%
of the cases should have the chi-square statistics takes value
between 0 to 15.086.

By considering the observed frequency is the result of one of many possible samples, we can test
the hypothesis if it fits the particular probability distribution function by the following procedures:

Step 1. Define the null hypothesis and the alternative hypothesis


H0: data distribute according to the probability distribution function
v.s. H1: data distribute different from the probability distribution function

Step 2. Set up rejection region as


2 > 2, d
where d is the degrees of freedom (number of subgroup -1)
(It is always a one-tailed test)

Step 3. Compile the test statistics


(Oi − Ei) 2
2 = 
i Ei

Step 4. When the chi square statistics falls into the rejection region, the null hypothesis is
rejected, otherwise the null hypothesis is not rejected.

115
Applied Statistics

The entries in Table V are values for which the area to their right under the chi-square distribution
with given degrees of freedom (the gray area in the figure) is equal to  .

TABLE V VALUES OF  2

d.f. d.f.
 0.2 05  0.2 01
1 3.841 6.635 1
2 5.991 9.210 2
3 7.815 11.345 3
4 9.488 13.277 4
5 11.070 15.086 5
6 12.592 16.812 6
7 14.067 18.475 7
8 15.507 20.090 8
9 16.919 21.666 9
10 18.307 23.209 10
11 19.675 24.725 11
12 21.026 26.217 12
13 22.362 27.688 13
14 23.685 29.141 14
15 24.996 30.578 15
16 26.296 32.000 16
17 27.587 33.409 17
18 28.869 34.805 18
19 30.144 36.191 19
20 31.410 37.566 20
21 32.671 38.932 21
22 33.924 40.289 22
23 35.172 41.638 23
24 36.415 42.980 24
25 37.652 44.314 25
26 38.885 45.642 26
27 40.113 46.963 27
28 41.337 48.278 28
29 42.557 49.588 29
30 43.773 50.892 30

116
Applied Statistics

Example
An ordinary die is thrown 120 times and each time the number on the uppermost face is noted.
The results are as follows:

Number 1 2 3 4 5 6 Total
Observed Frequency 14 16 24 22 24 20 120

Test, at the 5% level, whether the die is fair.

Solution
Step 1. H0 : “1” : “2” : “3” : “4” : “5” : “6” = 1 : 1 : 1 : 1 : 1 : 1
H1 : “1” : “2” : “3” : “4” : “5” : “6” ≠ 1 : 1 : 1 : 1 : 1 : 1

(
2
Step 2. Reject H0 if χ2 > 11.070 (0.05, 5) =11.070)

Step 3. Outcome 1 2 3 4 5 6
Observed frequency 14 16 24 22 24 20
Expected frequency 20 20 20 20 20 20

(14−20)2 (16−20)2 (24−20)2 (22−20)2 (24−20)2 (20−20)2


𝜒2 = + + + + + = 4.4
20 20 20 20 20 20

Step 4. As χ2 = 4.4 < 11.070, H0 is not rejected at the 5% level of significance.

Conclusion: The die is concluded as fair.

117
Applied Statistics

Chapter 11 Correlation and Regression


In this chapter, relationship between two continuous numerical variables will be studied.
Coefficient of correlation is compiled to measure the strength of association / relationship, while
simple linear regression model is used to predict the values of dependent variable based on the
values of one independent variable.

Major learning objectives of this Chapter:


➢ Be able to present relationship between two continuous numerical variables by the
calculation of the coefficient of correlation and interpret it
➢ Be able to fit the simple linear regression model between two variables and interpret the
meaning of it
➢ Be able to predict the value of the dependent variable given the value of the independent
variable based on the fitted regression model and comment on its reliability

Example 1
When you want to rent a flat, do you think if there is any relationship between the size of the flat
and the monthly rental cost? Does a bigger flat worth a higher monthly rental cost? Is it
possible to predict the monthly rental cost by knowing the size of the flat? In order to review
the relationship between the size of the flat and the monthly rental cost, here below is the
information collected from a property agency for a sample of 10 flats:

Apartment X Y
Size (square feet) Monthly Rent ($)
1 700 8200
2 650 7500
3 690 7900
4 500 6700
5 820 10500
6 730 7900
7 740 7500
8 680 6800
9 540 6300
10 670 7000

118
Applied Statistics

Scatter Diagram

A scatter diagram is used to review the relationship between two variables by plotting a sample
of (x,y) data in a x-y plane. The nature of the relationship between two variables can take many
forms. The simplest relationship consists of a straight-line, which is called the linear
relationship.

Strong positive correlation Moderate positive correlation

Y Y

X X

Moderate negative correlation Strong negative correlation

Y Y

X X

119
Applied Statistics

Example 1

This is the scatter plot between the size of the flat (X) and the monthly rental cost (Y). Would
you say the correlation is positive or negative?

12000

10000

8000

6000

4000

2000

0
0 100 200 300 400 500 600 700 800 900

120
Applied Statistics

Coefficient of Correlation

While the scatter plot is very useful for us to visualize the relationship, the strength of the
relationship cannot be read out precisely. The coefficient of correlation, r, is a measure of the
strength of a linear relationship between two variables. The measurement r ranges from -1 to
1, where -1 indicates a perfect negative linear relationship and 1 indicates a perfect positive linear
relationship. The coefficient of correlation, r, is defined as

n xy −  x y
r=
 n x 2 − ( x )2   n y 2 − ( y )2 
       

In general, when |r| < 0.3 the relationship is pretty weak. When |r| is around 0.5 the relationship
is moderate. While |r|> 0.7 indicates a strong relationship.

Example 1
Consider the following information about the monthly rental cost (Y) and the size of the apartment
(X). Compile the correlation coefficient and comment on it.

Apartment x y x2 y2 xy
Size Monthly
(square feet) Rent ($)
1 700 8200 490000 67240000 5740000
2 650 7500 422500 56250000 4875000
3 690 7900 476100 62410000 5451000
4 500 6700 250000 44890000 3350000
5 820 10500 672400 110250000 8610000
6 730 7900 532900 62410000 5767000
7 740 7500 547600 56250000 5550000
8 680 6800 462400 46240000 4624000
9 540 6300 291600 39690000 3402000
10 670 7000 448900 49000000 4690000
Total 6720 76300 4594400 594630000 52059000

Solution
r = 0.7938 (from calculator)
Remark:
10(52059000 ) − (6720)(76300)
r= = 0.7938
10(4594400) − 6720 10(594630000) − 76300 
2 2

➢ This indicates a strong positive correlation between the size of the apartment and the
monthly rental cost.

121
Applied Statistics

Spurious correlation (Reference Reading)

A spurious correlation is a relationship between two variables that appear to have


interdependence or association with each other but actually do not. Spurious correlation is often
caused by a third factor that is not apparent at the time of examination. For example, when data
of X: cost of electricity and Y: personal education fee is collected over time, correlation is found.
However, the correlation is explained by a confounding factor: inflation, which makes both
electricity and education costs grow over time.

The main tool in diagnosing whether a correlation is spurious or not is to examine the quality of
the theory behind it. In the case of tobacco and lung cancer, only a clear explanation for the
biological mechanism that caused smoking to lead to lung cancer settled the debate.

122
Applied Statistics

Simple Linear Regression Model

When the correlation measures the strength of relationship, it does not indicate the cause and
effect relationship between the two variables. When one variable (dependent variable, Y) is
assumed to be dependent on the other variable (independent variable, X) linearly, the simple linear
regression model can be used to outline the relationship between them.

The simple linear regression model can be represented by an equation:


Yi = a + bX i +  i

where Y is the dependent variable,


X is the independent variable,
a is the Y intercept,
b is the slope of the linear regression, and
 i is the random error in Y for observation i.

It is reasonable to assume the monthly rental cost depends on the size of the flat, while it sounds
a bit strange if we say that the size of the flat depends on the monthly rental cost. So the monthly
rental cost is the dependent variable (Y), which value depends on the independent variable (X),
the size of the flat. Put it in a linear model, Y = a + bX. However, what values of a and b most
suitable to explain the relationship for this set of (X,Y)?

12000
Y
10000

8000

6000

Y = a + bX
4000

2000

0 X
0 100 200 300 400 500 600 700 800 900

123
Applied Statistics

Converting the random error as 𝜀𝑖 = 𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 . The idea of finding the best linear
regression line for the data is to find the best pair of a and b so that the accumulated squared error
is the smallest. The best straight line (least square regression line) that represents the points:

y = a + bx,

where b =
n xy −  x y
, and a =
 y −b x
n x 2 − (  x )
2
n n

Example 1
Assume linear relationship is found between the monthly rental cost (Y) and the size of the
apartment (X). Fit the regression line Y = a + bX and interpret the values of a and b.

Apartment x y x2 xy
Size Monthly
(square feet) Rent ($)
1 700 8200 490000 5740000
2 650 7500 422500 4875000
3 690 7900 476100 5451000
4 500 6700 250000 3350000
5 820 10500 672400 8610000
6 730 7900 532900 5767000
7 740 7500 547600 5550000
8 680 6800 462400 4624000
9 540 6300 291600 3402000
10 670 7000 448900 4690000
Total 6720 76300 4594400 52059000

124
Applied Statistics

12000

10000

8000 Y = 911.7108 + 9.9975X

6000

4000

2000

0
0 100 200 300 400 500 600 700 800 900

Solution
a = 911.7108, b = 9.9975 (from calculator)

Remark:
10(52059000 ) − (6720)(76300)
b= = 9.9975
10(4594400) − (6720) 2

76300  6720 
a= − 9.9975  = 911.68
10  10 

➢ When the size of the flat is 0, the rental cost is $911.71.


For every extra square foot increased in the size, the rental cost is increased by $9.9975.

Remark
Interpret the value of a alone does not make any sense (refer to extrapolation estimation in the
next page)

125
Applied Statistics

Estimation and Reliability

One important application of the regression model is to predict / estimate the value of dependent
value y for a given x. The value of y for a given x is estimated as:

yˆ = a + bx

Rather the estimation is reliable or not depends on i) the value of x and ii) the correlation between
x and y.

When the value of x lies within the range of (Minimum X, Maximum X) in the dataset, this
estimation is called interpolation. For interpolation estimation with strong correlation, the
estimation is reliable; for interpolation estimation with moderate correlation, the reliability of the
estimation is questionable; for interpolation estimation with weak correlation, the estimation is
unreliable.

When the value of x is outside the range of (Minimum X, Maximum X) in the dataset, the
estimation is called extrapolation. The extrapolation estimation is always unreliable as we
cannot guarantee the linear relationship is still valid outside the range of the existing dataset. So
extrapolation estimation should be avoided.

Example 1
What are the estimated monthly rental cost for (a) a 800 square feet flat and (b) a 2000 square
feet flats? Comment on their reliabilities with reasons.

Solution
(a) 𝑦̂ = 911.7108 + 9.9975(800) = 8909.71 ($)
➢ When a flat is 800 square feet, the estimated monthly rental cost is $8909.71. The
estimation is reliable as it is interpolation estimation with high correlation

(b) 𝑦̂ = 911.7108 + 9.9975(2000) = 20906.71 ($)


➢ When a flat is 2000 square feet, the estimated monthly rental cost is $20906.71. The
estimation is unreliable as it is extrapolation estimation.

126
Applied Statistics

Rank Correlation

The rank correlation, rs, is the measure of relationship between two variables when the ranks,
instead of the actual values, of the two variables are used.

The measurement rs is compiled as


6 d 2
rs = 1 −
n(n 2 − 1)

where d is the difference of the ranks between each x and y.

Example 2
A kid is invited to do a blinded taste test of 6 ice-creams. After tasting all the ice-creams, he
arrange them in ascending order according to how much he likes them. Below is the
information about the price and the kid’s rating of the ice-creams.

Ice-cream Price ($) Kid’s rating


A 8.5 4
B 9.0 2
C 12.5 1
D 14.0 6
E 25.5 5
F 30.0 3

Calculate the rank correlation between the price and the kid’s rating. Comment on it.

Solution

Ice-cream Price ($) Rank Rank d d2


Order in Order by
Price the kid
A 8.5 1 4 -3 9
B 9.0 2 2 0 0
C 12.5 3 1 2 4
D 14.0 4 6 -2 4
E 25.5 5 5 0 0
F 30.0 6 3 3 9
Total -- -- -- 0 26

6(26)
𝑟𝑠 = 1 − = 0.2571
6(62 −1)
There is a weak positive correlation between the price and the kid’s rating of these 6 ice-
creams.

Remarks:
1. When there is no tied data, the calculation of rank correlation can be done by calculator
(regression mode) by inputting the rank data.
2. Where there is tied data, the same rank should be assigned to the tied data by taking average
of the ranks and a correction factor should be applied in the calculation of rs.
127
Applied Statistics

Calculator usage on Regression (fx-50FH / fx-50FH II)

1. Calculator Mode: REG


MODE MODE REG Lin

2. Clear Previous Data


SHIFT CLR Stat EXE

Data Set:
x 5.2 7.3 8.8 10.2 13.1 14.4 15.2 16.6 18.3 19.7 20.3 20.5
y 1.6 2.2 1.4 1.9 2.4 2.6 2.3 2.7 2.8 2.6 2.9 3.1

3. Input Data
5.2 , 1.6 DT
7.3 , 2.2 DT
: :
20.3 , 2.9 DT
20.5 , 3.1 DT

4. Essential statistics
n: SHIFT 1 3 EXE = 12
x : SHIFT 1 2 EXE = 169.6

x
2
: SHIFT 1 1 EXE = 2702.7

y : SHIFT 1 2 EXE = 28.5

y
2
: SHIFT 1 1 EXE = 70.69

 xy : SHIFT 1 3 EXE = 429.62

5. Calculating Regression Data


Regression line
a : SHIFT 2   1 EXE = 1.1350
b : SHIFT 2   2 EXE = 0.0877
Coefficient of correlation
r : SHIFT 2   3 EXE = 0.8853

a : SHIFT 2 VAR   1 EXE = 1.1350


b : SHIFT 2 VAR   2 EXE = 0.0877
r : SHIFT 2 VAR   3 EXE = 0.8853

128
Applied Statistics

Tables to be provided in the examination

The entries in Table I are the probabilities that a random variable having the
standard normal distribution will take on a value between 0 and z. They are given
by the area of the gray region under the curve in the figure.

TABLE I NORMAL-CURVE AREAS

z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4648 0.4656 0.4664 0.4671 0.4678 0.4685 0.4692 0.4699 0.4706
1.9 0.4713 0.4719 0.4725 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
Also, for z = 4.0, 5.0 and 6.0, the areas are 0.49997, 0.4999997, and 0.499999999.

129
Applied Statistics

The entries in Table II are values for which the area to their right under the t
distribution with given degrees of freedom (the gray area in the figure) is equal
to  .
TABLE II VALUE OF t

d.f. t0.050 t0.025 t0.010 t0.005 d.f.


1 6.314 12.706 31.821 63.657 1
2 2.920 4.303 6.965 9.925 2
3 2.353 3.182 4.541 5.841 3
4 2.132 2.776 3.747 4.604 4
5 2.015 2.571 3.365 4.032 5

6 1.943 2.447 3.143 3.707 6


7 1.895 2.365 2.998 3.499 7
8 1.860 2.306 2.896 3.355 8
9 1.833 2.262 2.821 3.250 9
10 1.812 2.228 2.764 3.169 10

11 1.796 2.201 2.718 3.106 11


12 1.782 2.179 2.681 3.055 12
13 1.771 2.160 2.650 3.012 13
14 1.761 2.145 2.624 2.977 14
15 1.753 2.131 2.602 2.947 15

16 1.746 2.120 2.583 2.921 16


17 1.740 2.110 2.567 2.898 17
18 1.734 2.101 2.552 2.878 18
19 1.729 2.093 2.539 2.861 19
20 1.725 2.086 2.528 2.845 20

21 1.721 2.080 2.518 2.831 21


22 1.717 2.074 2.508 2.819 22
23 1.714 2.069 2.500 2.807 23
24 1.711 2.064 2.492 2.797 24
25 1.708 2.060 2.485 2.787 25

26 1.706 2.056 2.479 2.779 26


27 1.703 2.052 2.473 2.771 27
28 1.701 2.048 2.467 2.763 28
29 1.699 2.045 2.462 2.756 29
Inf. 1.645 1.960 2.326 2.576 Inf.

130
Applied Statistics 2019-20

The entries in Table III are values for which the area to their right under the F
distribution with given degrees of freedom (the gray area in the figure) is equal to 

TABLE III VALUE OF F0.05

Degrees of freedom for numerator, r1


1 2 3 4 5 6 7 8 9 10
1 161 200 216 225 230 234 237 239 241 242
2 18.5 19.0 19.2 19.2 19.3 19.3 19.4 19.4 19.4 19.4
3 10.1 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8.81 8.79
4 7.71 6.94 6.59 6.39 6.26 6.16 6.09 6.04 6.00 5.96
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77 4.74
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10 4.06
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68 3.64
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39 3.35
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18 3.14
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98
Degrees of freedom for denominator, r2

11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91
 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83

131
Applied Statistics 2019-20

The entries in Table IV are values for which the area to their right under the F
distribution with given degrees of freedom (the gray area in the figure) is equal to 

TABLE IV VALUE OF F0.01

Degrees of freedom for numerator, r1


1 2 3 4 5 6 7 8 9 10
1 4,052 5,000 5,403 5,625 5,764 5,859 5,928 5,982 6,023 6,056
2 98.5 99.0 99.2 99.3 99.3 99.3 99.4 99.4 99.4 99.4
3 34.1 30.8 29.5 28.7 28.2 27.9 27.7 27.5 27.4 27.2
4 21.2 18.0 16.7 16.0 15.5 15.2 15.0 14.8 14.7 14.6
5 16.3 13.3 12.1 11.4 11.0 10.7 10.5 10.3 10.2 10.1
6 13.7 10.9 9.78 9.15 8.75 8.47 8.26 8.10 7.98 7.87
7 12.2 9.55 8.45 7.85 7.46 7.19 6.99 6.84 6.72 6.62
8 11.3 8.65 7.59 7.01 6.63 6.37 6.18 6.03 5.91 5.81
9 10.6 8.02 6.99 6.42 6.06 5.80 5.61 5.47 5.35 5.26
10 10.0 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85
Degrees of freedom for denominator, r2

11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10
14 8.86 6.52 5.56 5.04 4.70 4.46 4.28 4.14 4.03 3.94
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.90 3.81
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69
17 8.40 6.11 5.19 4.67 4.34 4.10 3.93 3.79 3.68 3.59
18 8.29 6.01 5.09 4.58 4.25 4.02 3.84 3.71 3.60 3.51
19 8.19 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26
23 7.88 5.66 4.77 4.26 3.94 3.71 3.54 3.41 3.30 3.21
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17
25 7.77 5.57 4.68 4.18 3.86 3.63 3.46 3.32 3.22 3.13
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47
 6.64 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32

132
Applied Statistics 2019-20

The entries in Table V are values for which the area to their right under the chi-square distribution with
given degrees of freedom (the gray area in the figure) is equal to  .

TABLE V VALUES OF  2

d.f. d.f.
 0.2 05  0.2 01
1 3.841 6.635 1
2 5.991 9.210 2
3 7.815 11.345 3
4 9.488 13.277 4
5 11.070 15.086 5
6 12.592 16.812 6
7 14.067 18.475 7
8 15.507 20.090 8
9 16.919 21.666 9
10 18.307 23.209 10
11 19.675 24.725 11
12 21.026 26.217 12
13 22.362 27.688 13
14 23.685 29.141 14
15 24.996 30.578 15

16 26.296 32.000 16
17 27.587 33.409 17
18 28.869 34.805 18
19 30.144 36.191 19
20 31.410 37.566 20
21 32.671 38.932 21
22 33.924 40.289 22
23 35.172 41.638 23
24 36.415 42.980 24
25 37.652 44.314 25
26 38.885 45.642 26
27 40.113 46.963 27
28 41.337 48.278 28
29 42.557 49.588 29
30 43.773 50.892 30

133

You might also like