AS Lecture Note 2021
AS Lecture Note 2021
Applied Statistics
This course focuses on providing students with basic statistical knowledge. It will equip them
with the essential statistical skills to identify and apply appropriate techniques to various
problems, as well as to make informed decisions. Upon completion of the course, students
should understand the basic statistical theories and will be able to apply, to analyse, present and
interpret data using basic statistical methods.
Syllabus
1. Sampling Methods
2. Statistical Measures and Data Presentation
3. Probability
4. Probability distributions and expectation
5. Normal distribution
6. Sampling Distributions and Central Limit Theorem
7. Estimation
8. Hypothesis Testing
9. Analysis of Variance
10. Chi square test
11. Linear Regression
References
1. Allen G. Bluman (2013), Elementary Statistics: A Step by Step Approach, 9th Edition, McGraw Hill.
2. Berenson, Mark L., Levine, David M. & Szabet, Kathryn A. (2015), Basic Business Statistics:
Concept and Applications, 13th edition, Pearson Education Limited.
Assessments
Individual assignments 60%
End of term assessment 40%
Class Lecturer
Name:
Contact email:
1
Applied Statistics
General Reminders:
1. Remember to bring a HKEA approved calculator with SD (statistics) and REG (regression)
functions (No graphical display), to classes and assessment.
2. Check SOUL course link and class link frequently for updated information about the
course and class management.
3. Correct your answer to 4 decimal places when necessary.
2
Applied Statistics
Introduction
What is Statistics?
Statistics is the science that processes and analyzes data in order to produce meaningful
information. Careful use of statistical methods will enable us to obtain accurate information
from data. These methods include (1) carefully defining the situation, (2) gathering data, (3)
accurately summarizing the data, and (4) deriving and communicating meaningful conclusions.
Statistics involves information, methods to summarize this information, and their interpretation.
The field of statistics can be roughly subdivided into two areas: descriptive statistics and
inferential statistics. Descriptive Statistics focus on data collection (e.g. Methodology), data
presentation (e.g. Charts & Tables) and description of sample data (e.g. Average & Standard
deviation). Inferential Statistics estimate the population characteristics by means of the
sample statistics (e.g. Average) and uncover other useful information of the population (e.g. the
relationship between income and expense).
Read the following examples and you would see that statistics is being used in almost every
different area:
(i) As a business student, you are required to review the sales of a new product, green tea ice-
cream. One way to review the sales is to refer to the number of boxes of green tea ice-
cream sold in a week in supermarkets. Furthermore, you need to compare the sales of
green tea ice-cream to the sales of chocolate ice-cream.
(ii) As a marketing student, you are asked to review the effectiveness of a series of promotion
in a shopping mall. One measurement is the comment (like / dislike) of teenagers.
Furthermore, you need to compare the response made by male and female teenagers.
(iii) As an aviation student, you are working on the travel control team in a summer project.
You are asked to check the delay time of flights arriving Hong Kong International Airport.
Furthermore, you are asked to compare the delay time of flights departed from different
countries.
3
Applied Statistics
Gathering Data
There are many ways to gather data. One can collect primary data through observations, doing
experiments, conducting surveys. We can also collect secondary data by searching existing
information through publications or previous researches. In this chapter, we will focus on how
to collect primary data by conducting survey.
When we talk about collecting data, usually we do not aim at collecting one piece of data.
Instead, we are talking about a large scale data collection based on our research objective.
Questionnaire is usually designed with a standardized set of questions to be asked. Once the
questionnaire is designed we need to define who /which is the most suitable person / unit to
response to the questionnaire? Subject / item / element is the target information provider of the
specified research objective. Referring to the previous examples, we would define the subject
of each research as:
If data is collected from every subject of the population, this is known as a census. When the
population is small, this could be a straightforward exercise. When the population size becomes
larger, taking a census can be very time-consuming. When the population becomes very large,
it is not possible to survey every member. Also, in some situation, it is not possible to survey
every member. For example, how would you interview every teenager in Hong Kong?
When the data collection process covers less than 100% of the population, it is known as a sample
survey. Sample data can be obtained relatively cheaper and quicker and if the sample is a good
representative of the population, a sample survey can give an accurate indication of the
population characteristic being studied.
Let’s have a simple comparison between conducting census and survey. Imagine there are 500
supermarkets selling the green tea ice-cream. You are required to collect information about the
number of boxes of green tea ice-cream sold in a supermarket during a week. If you do a census,
you need to visit all 500 supermarkets and keep a record of the number of boxes of green tea ice-
cream sold in each of the supermarkets, and by the end of the census you would prepare a database
as in an excel file format like this:
4
Applied Statistics
After the census, data analysis would be conducted in order to answer the research objective.
Simple analysis such as the calculation of population mean and population standard deviation
would be the starting point for the analysis of the numerical dataset. (Do you know what it
means if the population mean number of boxes of green tea ice-cream sold is 75 with the
population standard deviation is 18?)
If you cannot do a census due to any reason (time limitation, budget problem, …), then you may
end up with doing a survey. For example, you do a survey with sample size n = 30. In this
case, you will only collect data from 30 supermarkets and keep a record for the collected data
like this:
After doing survey, we would also want to do analysis from the collected data in order to draw
conclusion about the research objective. However, as the data collection is incomplete, we need
to be very careful when we try to make the conclusion. The reliability of the conclusion from a
sample survey very much depends on how good is the sample as a representative of the population.
Put it simple, when doing census, 100% of the data is collected. You can analyze the data to
explain the situation with no missing information. When a survey is conducted, you have some
information to study. However, there may be bias between the sample statistics and the
population parameter. So when we do a survey, try to:
Sampling Methods
There are many sampling methods, which can be grouped into two categories: random and non-
random.
6
Applied Statistics
If no sampling frame exists, then quota sampling may be the only practical method of obtaining
a sample. This method is very widely used is marketing research. First the population is
subdivided into groups in terms of age, sex, income level etc. Then the interviewer is told how
many people to be interviewed within each specified group but is given no specific instructions
about how to locate them and fulfill the quota. It is quick to use, complications are kept to a
minimum. However, just as convenience sampling, bias may occur based on the interviewers'
subjective selection process.
In conclusion, random sampling is a fairer way to select sample. Every subject in the population
should have a chance to be selected so to avoid bias due to subjective selection. In order to do
a random sampling, an updated sampling frame must be prepared and every subject in the
population would be assigned with a unique identity number for the selection purpose.
7
Applied Statistics
Types of Data
To obtain data we must observe or measure something. This something is known as a variable.
For example, in the introduction section we talk about collecting information about (i) the number
of boxes of green tea ice-cream sold in a supermarket during a week, (ii) response to the
promotion, and (iii) delay time of a flight.
There are two major types of data: quantitative (numerical) and qualitative (non-numerical).
Data
Quantitative Qualitative
Discrete Continuous
(i) Number of boxes of ice-cream sold in a supermarket during a week is a quantitative variable.
Depending on whether the green tea ice-cream is a popular choice, the number of boxes of green
tea ice-cream sold in a supermarket during a week can be any positive integer, the more popular
is the green tea ice-cream, the more number of boxes of ice-cream sold during a week.
(ii) There are many ways to measure “response”. As the response to a series of promotion
activities is classified as “like” or “dislike”, this variable is a qualitative variable. Every
interviewee is simply asked to indicate how one feel about the promotion activities by putting
yourself as in the “like” group or “dislike” group. If the overall proportion of “like” response
is high, then the promotion is a success while a low proportion of “like” response means the
promotion is failed to improve the image of the product.
(iii) The delay time of a flight is a quantitative variable. The delay time of a flight can be any
real number. Positive real number means a delay, while negative real number means the flight
arrives earlier than the expected time.
Consider the two examples of quantitative data, two scales of measurement can be further defined.
The weekly sales of the number of boxes of ice-cream can only take integer values, in this case,
this variable is considered as a discrete random variable. Delay time of a flight is measured on
a continuous scale, which is considered as a continuous random variable.
8
Applied Statistics
9
Applied Statistics
Given a raw set of data, there is often no apparent overall pattern. Perhaps some values are
more frequent, sometimes a few extreme values stand out, and usually the range of values is
noticeable. Presenting data involves such concepts as representative or average values, measure
of dispersion, and positions of various values, all of which fall under the broad topic of descriptive
statistics.
Given below is a sample of the number of boxes of green tea ice-cream sold in a supermarket
during one week time. The data is collected from a sample of 30 supermarkets. Below is the
ordered array of the data:
46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109
10
Applied Statistics
Mean
The mean of a data set is the average of all the data values.
Data collected from the whole population Data collected from a sample
= x=
x x
N n
46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109
46+49+50+⋯+109
Sample mean 𝑥̅ = 30
= 72.0667 boxes
Besides, by multiplying the mean with the size of the dataset, it gives back the total number:
72.0667 (30) = 2162 boxes
Remark:
Descriptive Statistics v.s. Inferential Statistics
⚫ Descriptive Statistics: The focus is to report the population mean / sample mean obtained in
the census / survey so to describe the characteristics of the variable
⚫ Inferential Statistics: When the sample mean obtained in the survey is used as an estimate
of the unknown population mean.
Mode
The mode of a data set is the value that occurs with greatest frequency.
46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109
11
Applied Statistics
Percentile
The pth percentile of a data set is a value such that at least p percent of the items take on this value
or less so the other (100-p) percent of the items take on this value or more.
* Data must be arranged in ordered array. Checking position in a raw data set does not give
any information related to percentile.
Find the 10th percentile, 25th percentile, 50th percentile, 75th percentile, and 88th percentile.
Solution
50+52 10
10th percentile = = 51 boxes (𝑖 = 100 (30) = 3)
2
th 25
25 percentile = 59 boxes (𝑖 = 100 (30) = 7.5 ↑ 8)
68+69 50
50th percentile = 2
= 68.5 boxes (𝑖 = 100 (30) = 15)
th 75
75 percentile = 85 boxes (𝑖 = (30) = 22.5 ↑ 23)
100
88
88th percentile = 99 boxes (𝑖 = 100 (30) = 26.4 ↑ 27)
➢ The worst 10% of the supermarkets recorded the sales of less than 51 boxes of green tea ice-
cream, half of the supermarkets had the sales of less than 68.5 boxes and the top 12% of the
supermarkets had the sales of more than 99 boxes.
Special cases
25th percentile = First Quartile Q1
50th percentile = Second Quartile Q2 = median
75th percentile = Third Quartile Q3
12
Applied Statistics
It is important to determine not only the location of the mean, but also look at the variation within
the data. Surely you can tell the difference between the two classes of students if (a) everyone
gets 76 marks in the examination so the mean is 76 marks and (b) student’s performance has a
large difference from 18 to 98 marks with the mean is 80 marks. In general, after reporting the
central location of the data, we will continue to report the variation among the data. There are
several ways to specify the variation in the data.
Range
It is the difference between the largest and smallest data values.
Range = maximum value – minimum value
46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109
46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109
25
Q1 = 59 boxes (𝑖 = 100 (30) = 7.5 ↑ 8)
75
Q3 = 85 boxes (𝑖 = (30) = 22.5 ↑ 23)
100
IQR = Q3 – Q1 = 85 – 59 = 26 boxes
13
Applied Statistics
Variance
The variance is the average of the squared differences between each data value and the mean.
Data collected from the whole population Data collected from a sample
(x − )2
=
(x − x)
2
=
2
s 2
N n −1
Remarks:
1. Think about the meaning of (𝑥 − 𝜇)2, it is defined as a new variable which measures the
square difference of each data point to the mean. The population variance is simply the
average of this new variable. When the variance is small, that means the difference of each
data point to the mean is small, it also means the data points are located closely together.
2. Why the sample variance with similar formula as the population variance but with the
denominator equals to n-1? It is because it makes the sample variance a better estimator
of the population variance.
Standard deviation
The standard deviation of a data set is the positive square root of the variance.
It is measured in the same unit as the data, making it more easily comparable, than the
variance, to the mean.
Population standard deviation is denoted as while the sample standard deviation is denoted
as s.
Practically, we calculate the standard deviation by using the calculator. (Refer to the
appendix)
46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109
14
Applied Statistics
When the relative frequency of a variable at different data value is plotted, the probability density
function of the variable is visualized. A distribution can have many different shapes. We can
classify distributions according to their skewness. A distribution is symmetric if the parts above
and below its center are mirror images in the density function. A distribution is skewed to the
right if the right side is longer, while it is skewed to the left if the left side is longer. In this
course, we use the quartiles and median to describe the skewness of data.
Symmetry: Q2 - Q1 = Q3 - Q2
(Example: height of a 10 years old boy)
In general, when
Q2 – Q1 = Q3 – Q2 symmetric distribution
Q2 – Q1 > Q3 – Q2 left-skewed distribution
Q2 – Q1 < Q3 – Q2 right-skewed distribution
46 49 50 52 52 56 58 59 59 60
62 65 65 65 68 69 70 72 75 76
80 82 85 88 94 96 99 99 102 109
25
Q1 = 59 boxes (𝑖 = (30) = 7.5 ↑ 8)
100
68+69 50
Q2 = = 68.5 boxes (𝑖 = (30) = 15)
2 100
75
Q3 = 85 boxes (𝑖 = 100 (30) = 22.5 ↑ 23)
In summary, the number of boxes of green tea ice-cream sold in a supermarket during one
week is a variable. According to the result collected from a sample of 30 supermarkets, the
mean was 72 boxes with standard deviation of 17.7 boxes. The worst 10% of the supermarkets
recorded the sales of less than 51 boxes, half of the supermarkets had the sales of less than 68.5
boxes and the top 12% of the supermarkets had the sales of more than 99 boxes. The data was
right-skewed distributed.
15
Applied Statistics
Whenever you collect a set of data it is useful to plot the distribution. In statistics, there are
several ways that we employed usually.
Histogram
Histograms are an efficient and common way to describe distributions of continuous variables.
In general, histograms plot the frequency of occurrence of some observation within given fixed
width intervals.
While determining the percentage of women taller than 177 cm means integrating the frequency
distribution (left figure). The same number can be obtained from the cumulative frequency
distribution (right) by simply setting a threshold value. Percentiles can also be read from the
cumulative frequency distribution directly, for example, 88th percentile = 177cm.
16
Applied Statistics
Each digit to the left of the vertical line is a stem. The digits on the right of the vertical line are
the leaves associated with the stems. For the first row the stem is 4 and the leaves are 2, 2 and
5. This row represents 42, 42 and 45. This stem and leaf diagram has been created by splitting
each number into two parts in which the tens digit becomes the stem and the units digit the leaf.
Once the data have been ordered into stems and leaves, it is usual to order the leaves in ascending
order.
Box Diagram
It is used for the purpose of display five features of a set of data, including the minimum, Q 1,
median, Q3, and maximum, in a proper scale (horizontally or vertically). Box diagram depicts
the location of the center, the spread of the data and the distribution of the data. The following
box plot is the examination marks of 196 students.
98
88
80
71
18
From the box plot, you can get the following information:
➢ Half of the students score less than 80 marks and half of the students score more than 80
marks.
➢ The range of the scores is 80 marks (98 – 18) and the interquartile range is 17 marks (88
– 71).
➢ The distribution is slightly left skewed.
The advantage of the diagram is that it can summarize all five important features in one graph.
It is useful especially in comparison of several distributions. However, unlike stem and leaf
diagram, it does not show the detail of every single data.
17
Applied Statistics
Sometimes instead of just focusing on the analysis of the given variable, it is also the interest to
analyse a function of it. A simple linear function, which involves multiplication of a constant,
addition of a constant, or both applications, is often observed in daily application.
Y = a + bX
X Y
Number of items sold in a Monthly salary, which is calculated with basic salary of
month by a salesperson $20000 and an allowance of $30 for each item sold
Y = 20000 + 30X
With the summary statistics for variable X has been calculated, the summary statistics of variable
Y can be calculated directly without regenerate the dataset with the following relationship
Summary statistics Y = a + bX
Mean Mean(Y) = a + b Mean(X)
Percentile pth(Y) = a + b pth(X)
Range Range(Y) = |b| Range(X)
IQR IQR(Y) = |b| IQR(X)
Standard deviation SD(Y) = |b| SD(X)
Variance Variance(Y) = b2 Variance(X)
Below is an example with X as the variable which measures the number of items sold by a
salesperson in month. The summary statistics are generated from a random sample of 15
salespersons. Without reviewing the raw dataset, the summary statistics for variable Y, the
monthly salary earned by a salesperson, can be easily generated as follow:
When calculating a statistical parameter of a data set, it is often necessary to use an intermediary
result (e.g. the mean) during the computation. By including such an estimator in the calculation,
the number of independent scores is reduced, or we say that the degree of freedom is reduced by
one.
When we consider the calculation of the sample variance, which is computed by averaging the
squares of the deviations from the mean value. As the population mean is unknown, it is
estimated by the sample mean.
=
(x − x)
2
2
s
n −1
Since the average x is computed from all scores x , the number of independent x in the formula
above is reduced from n to n-1 because you could calculate one particular score by using the
mean and the other (n-1) scores.
df = n − a
19
Applied Statistics
Date Set:
163.6 156.2 166.3 179.3 157.8 165.4 159.5 161.7 160.4
3. Input data
163.6 DT 156.2 DT 166.3 DT 179.3 DT
157.8 DT 165.4 DT 159.5 DT 161.7 DT
160.4 DT
5. Change Data
Example: change the first data ‘163.6’ to ‘183.6’
▲/▼ (until you see x1=163.6) 183.6 EXE
6. Delete Data
Example: delete the second data ‘156.2’
▲/▼ (until you see x2=156.2) SHIFT DT
20
Applied Statistics
s 2
=
( x − x) 2
N n −1
Standard deviation
=
(x − ) 2
s=
( x − x) 2
N n −1
Skewness
Describe the shape of If Q2 – Q1 > Q3 – Q2: left skewed
the distribution as If Q2 – Q1 = Q3 – Q2: symmetric
If Q2 – Q1 < Q3 – Q2: right skewed
21
Applied Statistics
Chapter 3 Probability
Probability is the likelihood or chance of the happening of “something”. For example, we may
want to know how likely a customer will spend $200 or more in one visit to the supermarket.
An event A calls ‘spending $200 or more’ can be defined. With sufficient information, we may
be able to evaluate that the probability that a customer will spend $200 or more in one visit to the
supermarket, denoted as P(A) = 0.8.
As you should have learnt some probability theories in your previous study, in this chapter, we
just review some of the important concepts.
The sample space is defined as the set of all possible outcomes, usually denoted by S.
An event is a subset of the sample space, which is also the probability statement you want to
evaluate.
Example 1
What is the probability of getting number “1” which a fair die is tossed?
S = {1, 2, 3, 4, 5, 6}
A = {1}
Example 2
What is the probability that a customer will spend $200 or more in a one visit to the supermarket?
S = {x 0} x is the spending in one visit to the supermarket
A = {x 200}
22
Applied Statistics
Compiling Probability
There are two different approaches to compile probabilities: classical probability and empirical
probability.
Classical Probability:
Assuming each sample point has the same opportunity of happening, the probability of an event
is:
𝑛(𝐴)
𝑃(𝐴) =
𝑛(𝑆)
where n(A) is the number of sample point in event A.
Example 1
What is the probability of getting number “1” which a fair die is tossed?
S = {1, 2, 3, 4, 5, 6}
A = {1}
1
P(A) =
6
Empirical Probability:
We need to observe the relative frequency in actual experiment and use the relative frequency of
the event as the probability. This type of probability could be used when we study the result of
a survey or record collected in the past.
Example 2
What is the probability that a customer will spend $200 or more in one visit to the supermarket?
S = {x 0}, x is the spending in one visit to the supermarket
A = {x 200}
480
P(A) = = 0.8, i.e. the probability that a customer will spend $200 or more in one visit to the
600
supermarket is 0.8.
23
Applied Statistics
Example 2
As the probability that a customer will spend $200 or more in one visit to the supermarket is 0.8,
it also implies the probability that a customer will spend less than $200 in one visit to the
supermarket is 0.2. Comparatively it is more likely a customer will spend $200 or more than
less than $200 in one visit to the supermarket.
Conditional Probability
When extra requirement is specified before probability is compiled, we say that the conditional
probability is required to be calculated. For example, the calculation of the below conditional
probabilities help us to compare the behavior of two group of customers.
Example 2
(i) What is the probability that a male customer will spend $200 or more in a one visit to the
supermarket?
(ii) What is the probability that a female customer will spend $200 or more in a one visit to the
supermarket?
In order to answer these two questions, we need to reorganize the previous information in a
contingency table:
175
(i) P(A | male) = = 0.7
250
305
(ii) P(A | female) = = 0.8714
350
➢ Comparatively, the chance for a female customer to spend $200 or more in one visit to
the supermarket is relatively higher than that for a male customer.
24
Applied Statistics
Independent Events
Just now we evaluate that the chance for a female customer to spend $200 or more in one visit to
the supermarket is higher than that for a male customer.
As knowing the gender of a customer would change the chance for that customer to spend $200
or more, in a statistical sense, we say that “the spending in one visit to the supermarket” and “the
gender” are two dependent events
Only when knowing the happening of one event does not change the chance of happening of the
other information, the two events are named as independent events.
480
(i) P(spent $200 or more) = = 0.8
600
175
(ii) P(spent $200 or more | male) = = 0.7
250
305
(iii) P(spent $200 or more | female) = 350 = 0.8714
➢ “The spending in one visit to the supermarket” and “the gender” are dependent variables.
(II) Are the spending in the movie watching and the gender independent variables?
Male Female Frequency
Spent < $80 55 70 125
Spent $80 or more 165 210 375
Total 220 280 500
375
(i) P(spent $80 or more) = = 0.75
500
165
(ii) P(spent $80 or more | male) = 220
= 0.75
210
(iii) P(spent $80 or more | female) = 280 = 0.75
➢ “The spending in one visit to the cinema” and “the gender” are independent variables.
25
Applied Statistics
A quantitative random variable is one in which the outcomes are expressed numerically.
Quantitative variables are classified as discrete or continuous. In this chapter, we will look at
how to present the characteristics of a discrete random variable by its probability distribution
function and expectation. We will also look at some special cases as the discrete random
variables follow some specify distribution, for example the Binomial distribution and Poisson
distribution. The presentation of a continuous random variable will be discussed in the next
Chapter.
26
Applied Statistics
Probability Distribution
For example, the telecommunication company wants to collect information about the number of
mobile phone an adult has. It is expected that most of the adult would have one mobile phone,
while some may have two, three or even four mobile phones. How would we know the ratio of
adult who has one, two, three, or four mobile phones? A simple method is to conduct a survey.
Suppose a survey has been conducted and according to the discussion we have in Chapter 3, one
can use the relative frequency in the survey result to project the probability of different events.
Suppose the following table is the result of the survey which involves a sample of 500 customers
and X represents the number of mobile phone a customer has:
x 1 2 3 4
Frequency 240 120 80 60
As expected, the biggest group of customers would be those with one mobile phone, which has
240 customers out of a total of 500 customers. If now you randomly select one customer and
talk to him, the chance that he has one mobile phone can be projected by the relative frequency,
240
𝑃(𝑋 = 1) = 500 = 0.48.
Now try to read the following table, which represents the probability distribution of the number
of mobile phone an adult has, X:
x 1 2 3 4
P(X = x) 0.48 0.24 0.16 0.12
(i) List out all the possible outcomes of the variable, and
(ii) Find the probability of each possible outcome
27
Applied Statistics
Example 1
x 1 2 3 4 5 6
P(X=x) 1 1 1 1 1 1
6 6 6 6 6 6
Example 2
“MovieClub” is an online platform which recruits movie lovers to share their movie watching
experience. The following is the probability distribution function of the number of visits to the
cinema made by a “MovieClub” member in a month
x 1 2 3 4 5 6
P(X=x) 0.05 0.23 0.35 0.23 0.13 0.01
28
Applied Statistics
Expected value of X
We already have the idea that most likely an adult would have 1 mobile phone, while an adult
can have as many as four mobile phones.
x 1 2 3 4
P(X = x) 0.48 0.24 0.16 0.12
Is there any way we can calculate the expected (average) number of mobile phone an adult have?
We learnt the concept of mean / average in Chapter 2. Applying for a dataset, mean is calculated
accumulated total
as number of data . If we use the survey result to calculate the mean, then among the 500
customers, there are 960 mobile phones, so the average = 1.92.
1(240)+2(120)+3(80)+4(60)
= 1.92
240+120+80+60
If you take a closer look at this calculation, you would aware the number of customers is not
really important. If you replace the frequency by the relative frequency (probability), then the
calculation of the mean would be:
This is the mean of X or expected value of X. The expected value of X is usually written as E(X)
and sometimes using . In general, for a discrete random variable of X:
= E(X ) = xP ( X = x)
29
Applied Statistics
Variance of X
It is always more difficult to understand the concept of variance. Again, how do you intercept
the idea of variance of X? As we say on the average, an adult has 1.92 mobile phones, does it
really mean everyone has 1.92 mobile phones? Of course not! There must be a difference
between the real value X to the expected value of X, variance is the measurement which measure
the average square difference of the data point to the mean
∑(𝑥 − 𝜇)2
𝑉𝑎𝑟(𝑋) =
𝑁
which can be simplified as
∑ 𝑥2
𝑉𝑎𝑟(𝑋) = − 𝜇2
𝑁
So for our example:
Method 1
Use the data set with 500 data, define a new variable as (X – μ)2, variance is the average of this
new variable
x 1 2 3 4
(x – μ)2 (1 - 1.92)2 (2 - 1.92)2 (3 - 1.92)2 (4 - 1.92)2
Frequency 240 120 80 60
Method 2
Consider X2 as a function of X, calculate the value of X2 and use the simplified formula to do the
calculation:
x 1 2 3 4
2 2 2 2
x 1 2 3 42
P(X = x) 0.48 0.24 0.16 0.12
Standard deviation of X
The positive square root of the variance gives the standard deviation of X.
σ(X) = √1.1136=1.0553
➢ In summary, an adult has an average of 1.92 mobile phones with a standard deviation of
1.0553 mobile phones.
30
Applied Statistics
Example 2
The following is the probability distribution function of the number of visits to the cinema by a
“MovieClub” member in a month
x 1 2 3 4 5 6
P(X=x) 0.05 0.23 0.35 0.23 0.13 0.01
What are the expectation and standard deviation of the number of visits to the cinema made by a
“MovieClub” member in a month?
Solution
E(X) = 1(0.05) + 2(0.23) + 3(0.35) + 4(0.23) + 5(0.13) + 6(0.01) = 3.19
Var(X) = 12(0.05) + 22(0.23) + 32(0.35) + 42(0.23) + 52(0.13) + 62(0.01) – 3.192 = 1.2339
(X) = √1.2339 = 1.1108
➢ On the average, a “MovieClub” member would make 3.19 visits to the cinema in a
month, with a standard deviation of 1.1108 times.
31
Applied Statistics
Function of X
In general if Y = f(X) is any function of the discrete random variable X then by substitution, the
probability distribution function of Y can be regenerated by transforming each possible value of
x to its corresponding value of y.
For example, given the probability distribution of the number of mobile phone an adult has, X:
x 1 2 3 4
P(X= x) 0.48 0.24 0.16 0.12
Suppose Y is the monthly spending on mobile service and assume Y = 150X, then the probability
distribution of the monthly spending on mobile service is as follow:
With the probability distribution of Y is generated, the expectation, variance, and standard
deviation of Y can be calculated:
E(Y) = 150(0.48) + 300(0.24) + (450)(0.16) + (600)(0.12) = $288
Var(Y) = 1502(0.48) + 3002(0.24) + (450)2(0.16) + (600)2(0.12) - 2882 = 25056
(Y) = √25056 = $158.29
➢ In summary, an adult spends an average of $288 for mobile service with standard
deviation of $158.29.
32
Applied Statistics
E(Y) = a + bE(X)
Var(Y) = b2Var(X)
(Y) = |b|(X)
For Y = 150X,
Then E(Y) = 150E(X) = 150(1.92) = $288
Var(Y) = 1502Var(X) = 1502(1.1136) = 25056
(Y) = 150(X) = $158.29
33
Applied Statistics
Suppose another survey is conducted and Y is the number of tablet device an adult has with its
probability distribution function is given as:
y 0 1 2 3
P(Y = y) 0.25 0.55 0.15 0.05
If now we are interested in the total number of electronic devices (mobile phone plus tablet device)
an adult has, then a new variable T is defined, where T = X + Y. The detailed probability
distribution function of T cannot be found out easily, however, by assume X and Y are independent,
the summary of T can be easily generated as:
➢ On the average, an adult has 2.92 electronic devices with a standard deviation of 1.3 items.
34
Applied Statistics
Binomial Distribution
In many situations, an experiment has (or can be converted as) only two possible outcomes, one
of the outcome is denoted as success and the other one, naturally, is denoted as failure. For
example, there is 80% chance that a customer will spend $200 or more in one visit to the
supermarket and 20% chance that a customer will spend less than $200 (example in Chapter 3).
When a series of identical experiments is repeatedly observed, the total number of successful
cases among the n independent identical trials most likely is our interest.
The binomial distribution is used to summarize / predict the outcome for the repeated
observations of the identical experiment.
Example 3
(a) Suppose there are 2 customers in the queue.
(i) How many of them may spend $200 or more?
(ii) What is the probability that exactly 1 of them spends $200 or more?
X ~ Bin(n, p)
and its probability distribution is given as:
𝑥
𝑃(𝑋 = 𝑥 ) = 𝑛𝐶𝑥 (𝑝) (1 − 𝑝)𝑛−𝑥 , x = 0, 1, 2, ... , n
35
Applied Statistics
Example 3
(a) (i) When there are 2 customers in the queue, with X denotes the number of customers
spend $200 or more, x can be either 0, 1, or 2.
(ii) There are two different situations. It will end up as exactly one of them spends $200
or more:
P(exactly one spends $200 or more)
= P(the first one spends $200 or more and the second one spends less than $200)
+ P(the first one spends less than $200 and the second one spends $200 or more)
= 0.8(0.2) + (0.2)(0.8) = 0.32
Do you remember how to
In a Binomial sense, X ~ Bin(2, 0.8)
construct the two levels
P(X = 1) = 2C1(0.8)(0.2) = 0.32
tree diagram?
(b) (i) Suppose there are 7 customers in the queue. Denote X as the number of customers
spend $200 or more where x can be either 0, 1, 2, 3, 4, 5, 6, or 7.
(ii) There are many ways (do you know how many ways?) so that there are exactly 5
customers spend $200 or more
As in (b), the variable, X, is the number of customers would spend $200 or more in the queue.
As we know
(i) there are 7 customers in the queue, and
(ii) the chance for each customer to spend $200 or more is 0.8
Then it is the case that the variable X follows a Binomial distribution, where X ~ Bin(7, 0.8)
x 0 1 2 3 4 5 6 7
P(X = x) 0.00001 0.0004 0.0043 0.0287 0.1147 0.2753 0.3670 0.2097
36
Applied Statistics
If you imagine we repeatedly group every 7 customers as a group and record the number of
customers in each group spend $200 or more. The average number of customers spends $200
or more in a group of 7 customers is presented as its expected value.
For X ~ Bin(n, p); the expectation, variance, and standard deviation of X are as follows:
E(X) = np
Var(X) = np(1-p)
σ(X) = √𝑛𝑝(1 − 𝑝)
Example 3(b)
What are the expectation and standard deviation of number of customers spend $200 or more for
many groups of 7 customers?
Solution
As there are 7 customers in each group and the chance of spending $200 or more for each
customer is 0.8, the number of customers spend $200 or more in a group follows Binomial
distribution, X ~ Bin(7, 0.8).
E(X) = 7(0.8) = 5.6 customers
Var(X) = 7(0.8)(0.2) = 1.12
σ(X) = √1.12 = 1.0583 customers
➢ For many group of 7 customers, on the average, 5.6 out of 7 customers spend $200 or
more, with the standard deviation of 1.0583 customers.
Remark:
The calculation of the expectation of a Binomial variable gives us some insight about the most
likely number of happenings in a group. As in our example, when there are 7 customers in the
queue, with the expectation of the number of customers spend $200 or more is calculated as 5.6
customers, that means most likely, there would be around 5 or 6 customers spend $200 or more.
The standard deviation helps to extend our prediction to a range covers those possibilities with
relatively high chance.
37
Applied Statistics
Continue the same discussion as in page 33, the use of linear function on a Binomial variable can
extend the application to a wider range.
Example 4
Johnny has joined the training program in an elderly center. However, he does not go to the
center every day. Based on his past record, the chance he goes to the elderly center in a
particular day is 0.75 and whether he goes to the center or not every day is independent event.
Every day he goes to the center, he will call the center to arrange the transportation and the
traveling fee is $15 per day.
(a) On the average, how many days he will go to the elderly center in a week (Monday to Friday)?
(b) What is the probability that he will go exactly 4 days in a week?
(c) On the average, how much is his traveling fee to the center in a week?
Solution
(a) Use X to denote the number of days Johnny will go to the elderly center in a week
(Monday to Friday). As there are five days in a week and the chance he will go in a day
is 0.75,
X ~ Bin(5, 0.75)
E(X) = 5(0.75) = 3.75 days
(b) P(X = 4) = 5C4(0.75)4(0.25)1 = 0.3955
(c) Use Y to denote the traveling fee in a week, Y = 15X
E(Y) = 15 E(X) = 15(3.75) = $56.25
38
Applied Statistics
Useful formulae
For T = X + Y
Expectation E(T) = E(X) + E(Y)
Variance Var(T) = Var(X) + Var(Y)
Standard deviation (T) = √𝑉𝑎𝑟(𝑋) + 𝑉𝑎𝑟(𝑌)
Expectation E(X) = np
Variance Var(X) = np(1-p)
Standard deviation (X) = √𝑛𝑝(1 − 𝑝)
39
Applied Statistics
For a continuous random variable, possible outcomes take values from a continuous spectrum.
For example, think about how long a flight would be delay compare with the expected arrival
time. The delayed time can be any real number.
As continuous random variable takes value from a continuous spectrum, that means theoretically,
there are infinitely many possible outcomes. Unlike discrete random variable which a specific
probability can be assigned for each outcome, a probability density function is used to tell the
relative likelihood at a specific location. Sometimes, a graphic presentation of the probability
density function helps to review the characteristics more easily. Here are some examples of
continuous random variables:
40
Applied Statistics
(a) This graph indicates the time a baby needs to finish a simple task in a regular body check.
Regarding to the graph, you would see that a baby takes 1 to 5 minutes to finish the task. Unlike
discrete random variable, there are infinitely many possibilities between 1 and 5 minutes. A
horizontal probability density function means that it is equally likely for a baby to finish the task
at every possible time, between 1 and 5 minutes.
(b) This graph indicates the time a student spends on revision in a week. This random variable
takes any value greater than 0 and the curve shows a down going (exponential decay) pattern.
It shows that most students do not spend much time on doing revision.
41
Applied Statistics
The Normal curve is symmetric and bell-shaped about a vertical line through the mean μ. And
we usually use the notation
X ~ N , 2 ( )
42
Applied Statistics
Revision
There are a few concepts relate to normal distribution you should have learnt in your previous
study. Let’s review them before we move on.
For a continuous random variable follows a normal distribution with mean µ, variance 2
(standard deviation ), it is commonly denoted as X ~ N(µ, 2).
43
Applied Statistics
Example 1
In a cafe, the spending of a customer for a cup of coffee, X, is known to follow a normal
distribution with mean $50 and standard deviation $10, X ~ N(50, 102)
That means, the spending on a cup of coffee is a variable, someone would spend more and some
would spend less. The mean spending is known to be $50 and the standard deviation is $10.
As it is a normal distribution, we also know that
Besides knowing the above basic information, can we do further analysis, such as
(a) What is the probability that a customer spends more than $53 for a cup of coffee?
(b) What is the value of k if 20% of the customers would spend less than $k for a cup of
coffee?
44
Applied Statistics
The standardized normal variable follows a normal distribution with mean 0 and standard
deviation 1, which is commonly denoted as Z ~ N(0, 12). This variable Z actually is any normal
𝑋−𝜇
variable after transforming each data point to its standard score with the formula Z = 𝜎 , where
is the mean and is the standard deviation of the original variable.
𝑋−𝜇
For X ~ N(µ, 2); with Z = 𝜎
; then Z ~ N(0, 12)
45
Applied Statistics
The entries in Table I are the probabilities that a random variable having the standard normal
distribution will take on a value between 0 and z. They are given by the area of the gray
region under the curve in the figure.
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4648 0.4656 0.4664 0.4671 0.4678 0.4685 0.4692 0.4699 0.4706
1.9 0.4713 0.4719 0.4725 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
Also, for z = 4.0, 5.0 and 6.0, the areas are 0.49997, 0.4999997, and 0.499999999.
46
Applied Statistics
Below is the top few rows of the standard normal table. Let’s see how to use the table to read
out probabilities relate to z = 0.32
0 0.32
The entries in Table I are the probabilities that a random variable having the standard normal
distribution will take on a value between 0 and z. They are given by the area of the gray
region under the curve in the figure.
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
➢ P(0 < Z < 0.32) = 0.1255 (12.55% data takes value between 0 to 0.32)
➢ P(-0.32 < Z < 0) = 0.1255 (12.55% data takes value between -0.32 to 0)
➢ P(Z > 0.32) = 0.3745 (37.45% data has value greater than value 0.32)
➢ P(Z < 0.32) = 0.6255 (62.55% data has value less than 0.32)
How does the standard normal table help us to review the probability function of other normal
variable? For example, how can we find out the probability that a customer would spend more
than $53 for a cup of coffee, suppose the spending follows a normal distribution with mean $50
and standard deviation $10?
47
Applied Statistics
We can standardize any normal variable X by subtracting it by μ and then divided it by the
standard deviation, . This gives the standard Normal variable Z.
X −
Z =
The value z, which is called the standard normal score, measures how far is the data away from
the mean, using standard deviation as the measurement unit.
If you want to calculate the probability for a normal variable, you may follow the following
procedure:
48
Applied Statistics
Example 1
In a cafe, the spending of a customer for a cup of coffee, X, is known to follow a normal
distribution with mean $50 and standard deviation $10, X ~ N(50, 102). What is the probability
that a customer spends more than $53 for a cup of coffee?
Solution
In order to find out the probability that a customer would spend more than $53 for a cup of coffee,
𝑋−50
the variable X, spending on a cup of coffee, is standardized with the following function Z = 10 .
53 X 0.30 Z
= 50 Z = 0
53 − 50
𝑃(𝑋 > 53) = 𝑃 (𝑍 >
10 53 − 50
) = 𝑃(𝑍 > 0.3) = 0.5 − 0.1179 = 0.3821
P( X 53) = P Z = P( Z 0.3)
10
49
Applied Statistics
By reversing the previous procedure, we can locate the normal score in a normal distribution that
fulfills a specific probability requirement.
1. Locate the unknown normal score (k) reasonably in the normal curve. Make sure you
aware if the normal score should be smaller than the mean or bigger than the mean.
3. Rewrite the probability statement for variable Z and find the value of a from the standard
normal table.
P(a < Z < 0) where a should be negative
or P(0 < Z < 𝑎) where a should be positive
4. Transform a back to k:
k = µ + (a)
Example 1
In a cafe, the spending of a customer for a cup of coffee, X, is known to follow a normal
distribution with mean $50 and standard deviation $10, X ~ N(50, 102). The manager wants to
know what should be the value of k so that 20% of the customers would spend less than $k for a
cup of coffee.
Solution
50
Applied Statistics
Example 2
Instead of selling only coffee, the cafe also sells sliced cakes. It is known that the spending on
a piece of cake follows a normal distribution with mean of $32 and standard deviation of $6.
(a) What is the probability that a customer spends less than $30 for a piece of cake?
(b) If 60% customers would spend more than $k for a piece of cake, what is the value of k?
Solution
Use Y to denote the spending on a piece of cake, Y ~ N(32, 62)
(a) A graph indicates Y < 30: (b) A graph indicates 60% of the spending is
more than $k:
30−32
(a) 𝑃(𝑌 < 30) = 𝑃 (𝑍 < ) = 𝑃(𝑍 < −0.33) = 0.5 − 0.1293 = 0.3707
6
51
Applied Statistics
Suppose variable X follows a normal distribution with known µ and σ, X ~ N(µ , σ 2).
For a variable Y which is a linear function of X and be expressed as Y = a + bX, then Y also
follows a normal distribution such that
Y ~ N(a + b µ, (b σ)2)
Example 1
In the cafe, the spending on a cup of coffee follows a normal distribution with mean of $50 and
standard deviation of $10, X ~ N(50, 102). Suppose the owner of the café is considering adjust
the selling price of each cup of coffee by marking up the original price by 8% and then a discount
of $2 will be applied.
(a) What are the (i) mean and (ii) standard deviation of the selling price of a cup of coffee after
the adjustment?
(b) After the adjustment, what is the probability that someone buy a coffee which costs $54 or
more?
Solution
(a) With X as the notation of the original price of a cup of coffee and use Y to denote the adjusted
price,
Y = 1.08X – 2
(i) Mean of Y = 1.08E(X) – 2 = 1.08(50) – 2 = $52
(ii) Standard deviation of Y = 1.08 σ (X) = 1.08(10) = $10.8
54−52
(b) P(Y ≥ 54) = P(𝑍 ≥ ) = P(Z ≥ 0.19) = 0.5 – 0.0753 = 0.4247
10.8
52
Applied Statistics
The sum of two or more independent Normal variables is also Normally distributed. For two
independent Normal variables such that
X1 ~ N(μ1, σ12), and X2 ~ N(μ2, σ22), then:
Example 3
In the cafe, the spending on a cup of coffee follows a normal distribution with mean of $50 and
standard deviation of $10, X ~ N(50, 102). It is also known that the spending on a piece of cake
follows a normal distribution with mean of $32 and standard deviation of $6, Y ~ N(32, 62).
Suppose the spending on a cup of coffee and a piece of cake are independent events. Imagine
there are many customers buying one cup of coffee and one piece of cake and you want to review
the total spending of a customer:
(i) What is the distribution of the total spending?
(ii) What is the probability that a customer spend more than $80 when buying a cup of coffee
and a piece of cake?
Solution
(i) For T to be the total spending, T = X + Y
T ~ N(50 + 32, 102 + 62);
T ~ N(82, 11.66192)
On the average, a customer spends $82 to buy a cup of coffee and a piece of cake.
There is a standard deviation of $11.6629.
(ii)
A graph indicates T > 80:
80−82
So that P(T > 80) = P(𝑍 > 11.6619) = P(Z > -0.17) = 0.5 + 0.0675 = 0.5675
53
Applied Statistics
Chapter 6
Sampling Distributions and Central Limit Theorem
In chapter 1 and 2, we talk about the concept and difference between doing census and doing
survey and understand how to calculate the mean as a summary of the characteristics of a data
set. In this chapter, we will go further to understand the relationship between population mean
and sample mean and connect them with the sampling distribution.
The objective of studying the sampling distribution is to build a foundation for the discussion of
the use of inferential statistics.
Let’s use a simple example to understand the idea of sample mean is a random variable.
When students are randomly assigned to different classes with each class size equals to 30, the
average result of each class can be calculated:
28+32+⋯+95+97
Class 1: 28, 32, …, 95, 97 mean result = = 68.3
30
33+35+⋯+96+98
Class 2: 33, 35, …, 96, 98 mean result = = 74.4
30
30+31+⋯+88+91
Class 3: 30, 31, …, 88, 91 mean result = = 72.2
30
Selecting one class of student and review the class mean is the same idea as selecting one sample
and look at the sample mean. It is easy to understand from the above example that sample mean
is not unique, but it is a variable.
If sample mean is a random variable, what are the characteristics of this random variable? Is it
discrete or continuous? What are the mean and standard deviation of this random variable?
54
Applied Statistics
Sampling distribution
Suppose many different samples of the same size are obtained by repeatedly sampling from a
population with population mean and population standard deviation . For each sample:
sample mean x is calculated and;
a histogram of these sample means is drawn
Sample 1
x̅ 1
Variable X,
population mean μ
Sample 2
population variance 2
x̅ 2
Sample 3
x̅ 3
The characteristics of the distribution function of the sample mean can then be summarized as
follows:
2. Variance of sample means equals population variance divided by the sample size.
𝜎2
̅
𝑉𝑎𝑟(𝑋) =
𝑛
3. Standard error, positive square root of the variance is a measurement used to represent the
average deviation from the individual sample mean to the population mean.
𝜎
𝑆𝐸(𝑋̅) =
√𝑛
If the sample size is reasonably large (n ≥ 30), the sample mean distribution is well approximated
by a normal distribution. (Central Limit Theorem)
2
X ~ N ,
n
55
Applied Statistics
With the requirement (i) or (ii) (or both) fulfilled, further analysis by using the normal variable
characteristics can be conducted.
Example 1
As mentioned earlier, suppose the examination result of General Statistics is as follow:
Population mean: 71.65
Population standard deviation: 14.29
For every 30 students are randomly assigned to a class, the sample mean distribution of class
average is as follow:
Mean of class mean: 71.65
14.292
Variance of class mean: = 6.8068
30
14.29
Standard error of class mean: = 2.6090
√30
Because the sample size is reasonably large, 𝑋̅ ~ N(71.65, 2.60902)
➢ If you compare the performance between individual students, the mean score is 71.65 and
the standard deviation is 14.29. However, if you compare the performance between
different classes (by using class mean to represent the performance of the class), the average
of the class mean score is 71.65 and the standard deviation of the class mean score is 2.61.
It is not a surprise that comparison between classes should be more stable than comparison
between students as in each class we have some well performed and not so well performed
students. The class mean takes balance between the high marks and low marks.
56
Applied Statistics
Example 2
A report indicates that on average a tourist spends $5000 in a 3-days trip to Taiwan. The
standard deviation of the spending is $600 so the variance is 360000($2). Imagine you are a
tour guide and you take care of a group (sample) of 40 tourists every day. If you make a long
term record of the mean spending of each group of 40 tourists per day, then you should aware
that the mean spending in each group is not constant, but a variable. Use X to denote the
̅ to denote the mean spending of a sample of 40 tourists, then
spending of an individual and X
Example 3
As in our example in Chapter 5, the spending on a cup of coffee is a normal variable, with
population mean of $50 and standard deviation of $10. If repeated random samples of size 30
are selected, then the sample mean spending 𝑋̅ is considered as a random variable where
Mean of sample mean E(𝑋̅)= = $50
𝑉𝑎𝑟(𝑋) 100
Variance of sample mean Var(𝑋̅) = 𝑛 = 30 = 3.3333
𝜎 10
Standard error of sample mean SE(𝑋̅) = =
= $1.8257
√𝑛 √30
Because the sample size 30 is large enough, then 𝑋̅ ~ N(50, 1.82572)
Remark:
𝑋− 𝜇
For X ~ N(, 2), standard score is calculated as Z = (Chapter 5)
𝜎
𝜎2 𝑋̅ − 𝜇
For 𝑋̅ ~ 𝑁 (𝜇, ), standard score is calculated as Z = 𝜎 (Chapter 6)
𝑛 ⁄ 𝑛
√
57
Applied Statistics
When the population variable is a quantitative variable (e.g. examination result), the sample is
usually summarized by the calculation of the sample mean. When the population variable is a
qualitative variable (e.g. gender of a student), the sample is then summarized by the calculation
of the sample proportion.
Example 1
Imagine, for the same group of 2000 students taking the course “General Statistics”, there are
1500 male, 500 female. The variable gender is a qualitative variable. Here, we use p to denote
the population proportion of male, for example, p = 0.75.
If a class of 30 students has 24 male and 6 female, we can use 𝑝̂ to denote the class proportion
of male (sample proportion) such that 𝑝̂ = 0.8.
Imagine now we select another class of 30 students, it is easy to realize the proportion of male in
this class may or may not be the same as the previous class. We have so many classes of student
and each class has its own class proportion of male. Again, we should consider sample
proportion as a random variable.
Now, try to include all possible sample proportions and review its probability density function.
When p is used to denote the given popultion proportion, the characteristics of the density
function of the sample proportion 𝑝̂ can be summarized as follows:
58
Applied Statistics
When the sample size is reasonably large (n ≥ 30), the sample proportion distribution is well
approximated by a normal distribution. (Central Limit Theorem)
𝑝(1 − 𝑝)
𝑝̂ ~𝑁(𝑝, )
𝑛
for n > 30, np > 5, n(1-p) >5
Example 1
For all year 1 students taking the course “General Statistics”, it is known that the population
proportion of male, p = 0.75.
For every 30 students are randomly assigned to a class, the distribution of proportion of male in
a class is as
Mean of class proportion of male = 0.75
0.75×0.25
Variance of class proportion of male = = 0.00625
30
0.75×0.25
Standard error of class proportion of male = √ = 0.0791
30
As the sample size is reasonable large, 𝑝̂ ~𝑁(0.75, 0.07912 )
➢ Approximately there are about 75% male in each class, but it is not fixed. The proportion
of male in a class has a standard deviation of 7.91% around the true level of 75%.
59
Applied Statistics
Example 4
Assume that among all customers of a jewelry shop, 40% customers are classified as “high
spending”. If random samples of size 70 are selected, and each time the sample proportion of
customers classified as “high spending” is calculated and denoted as 𝑝̂ , then
E(𝑝̂ ) = 0.4
0.4(0.6)
Var(𝑝̂ ) = = 0.0034
70
0.4(0.6)
SE(𝑝̂ ) = √ = 0.05855
70
As the sample size n = 70 is reasonably large, the sample proportion is normally distributed
𝑝̂ ~𝑁(0.4, 0.058552 )
Example 5
In Airline ABC, 20% of the customers book the air ticket for business trip. A promotion focus
on this group of tourists is recently launched. With many flights, each flight with 300 customers,
the sample proportion of customers having business trip is denoted as 𝑝̂ , then
With p to denote the population proportion of customers having business trip: p = 0.2,
𝑝̂ to denote the proportion of customers having business trip in a sample of 300 customers
E(𝑝̂ ) = 0.2
0.2(0.8)
Var(𝑝̂ ) = 300 = 0.0005
0.2(0.8)
SE(𝑝̂ ) = √ 300
= 0.02309
As the sample size n = 300 is reasonably large, the sample proportion is normally distributed
𝑝̂ ~𝑁(0.2, 0.023092 )
60
Applied Statistics
𝑋1 𝑋2 𝑋3 𝑋𝑛
𝑋̅ = + + +⋯+
𝑛 𝑛 𝑛 𝑛
𝑋1 𝑋2 𝑋3 𝑋𝑛
𝐸(𝑋̅) = 𝐸 ( + + + ⋯+ )
𝑛 𝑛 𝑛 𝑛
𝑋1 𝑋2 𝑋𝑛
= 𝐸( )+ 𝐸( )+ …+ 𝐸( )
𝑛 𝑛 𝑛
1 1 1
= 𝐸(𝑋1 ) + 𝐸(𝑋2 ) + … + 𝐸(𝑋𝑛 )
𝑛 𝑛 𝑛
𝜇 𝜇 𝜇
= + + …+
𝑛 𝑛 𝑛
𝜇
=(𝑛) 𝑛
= 𝜇
𝜎2
You may try to prove for 𝑉𝑎𝑟(𝑋̅) = 𝑛
𝑋1 𝑋2 𝑋3 𝑋𝑛
𝑉𝑎𝑟(𝑋̅) = 𝑉𝑎𝑟 ( + + + ⋯ + )
𝑛 𝑛 𝑛 𝑛
𝑋1 𝑋2 𝑋𝑛
= 𝑉𝑎𝑟 ( 𝑛 ) + 𝑉𝑎𝑟 ( 𝑛 ) + … + 𝑉𝑎𝑟 ( 𝑛 )
1 1 1
= 𝑉𝑎𝑟(𝑋1 ) + 𝑉𝑎𝑟(𝑋2 ) + … + 𝑉𝑎𝑟(𝑋𝑛 )
𝑛2 𝑛2 𝑛2
𝜎2 𝜎2 𝜎2
= + + …+
𝑛2 𝑛2 𝑛2
𝜎2
= (𝑛2 ) 𝑛
𝜎2
=
𝑛
61
Applied Statistics
Chapter 7 Estimation
One major type of inferential statistics is estimating the unknown parameter in the population by
the information collected from a sample. In this chapter, we will discuss how to estimate the
unknown population mean and population proportion.
In the previous chapter, for a continuous random variable X with population mean 𝜇 and
population standard deviation , the sample mean distribution for samples with sample size n
consists of the following characteristics:
(i) 𝐸(𝑋̅) = 𝜇
𝜎2
(ii) 𝑉𝑎𝑟(𝑋̅) =
𝑛
σ
(iii) SE(𝑋̅) = n
√
(iv) 𝑋̅ is normally distributed either n ≥ 30 or X is originally normally distributed
Similarly, for any qualitative random variable X with given population proportion in favour to
one particular option is denoted as p, the sample proportion distribution for samples with sample
size n consists of the following characteristics:
(i) 𝐸(𝑝̂ ) = 𝑝
𝑝(1−𝑝)
(ii) 𝑉𝑎𝑟(𝑝̂ ) = 𝑛
𝑝(1−𝑝)
(iii) SE(𝑝̂ ) = √
𝑛
(iv) 𝑝̂ is normally distributed when n ≥ 30, np > 5, and n(1-p) >5
In this chapter, because of the above sampling distribution characteristics, we are going to study
the technique of estimating the population mean by using the sample mean obtained from the
survey. We will also study the sampling distribution of sample proportion and make use of it to
estimate the population proportion.
Example 1
You are asked to review the lifetime of the light bulbs produced in a factory by reporting the
population mean lifetime. Lifetime is a continuous random variable. According to the
information provided by the factory, the population mean lifetime 𝜇 is unknown while the
population standard deviation is known to be 80 hours. How can we estimate the population
mean lifetime by not doing a census but only conducting a survey with sample size n = 50?
62
Applied Statistics
Example 1
In order to estimate the population mean lifetime, a random sample of 50 light bulbs is selected.
The sample mean lifetime is calculated as 680 hours.
In this case, the point estimate of the population mean lifetime is 680 hours.
The sample mean is a point estimate of the population mean. Definitely, a certain level of error
in the estimation is expected. The problem is: can the error in the estimation be calculated?
63
Applied Statistics
Error = 𝑥̅ − 𝜇
How large is this sampling error? We cannot derive the sampling error for a particular sample
as the population mean is unknown (you must remember this point). However, we can derive
the sampling error at a certain confidence level (some statisticians named this maximum sampling
error as margin of error), e.g. 95% confidence level. In order to derive the sampling error at a
certain confidence level, we must be familiar with the sampling distribution.
Example 1
As you remember, we just mentioned the lifetime of the light bulb in a factory has the following
characteristics:
population mean 𝜇, which is unknown,
population standard deviation = 80 hours
In order to estimate the population mean lifetime, a random sample of 50 light bulbs is selected.
If we don’t just focus on one particular sample, but consider we can repeatedly selecting many
samples, each with sample size n = 50, then the sample mean distribution is as:
802
𝑋̅~𝑁(, )
50
Proof:
𝑋̅ − 𝜇 80 80
P(-1.96 < Z < 1.96) = P(-1.96 < 80⁄ < 1.96) = 𝑃(−1.96 × < 𝑋̅ − 𝜇 < 1.96 × )
√50 √50
√50
80 80
= 𝑃(𝜇 − 1.96 × < 𝑋̅ < 𝜇 + 1.96 × )
√50 √50
64
Applied Statistics
0.025
0.025
-1.96 0 1.96 Z
𝜎 𝜎
𝜇 − 1.96 𝜇 𝜇 + 1.96 𝑋̅
√𝑛 √𝑛
𝜎
𝑧𝛼/2 ×
√𝑛
We call 𝑧𝛼/2 the critical value, while 𝛼/2 is the upper tail area in the normal curve.
Commonly used confidence level includes:
(The critical values can be easily found out from the standard normal table)
➢ There is a 95% chance that the difference between the calculated sample mean and the true
population mean is no more than 22.17 hours. Only 5% chance that this difference is more
than 22.17 hours.
65
Applied Statistics
𝜎 𝜎
(𝑥̅ − 𝑧𝛼/2 × , 𝑥̅ + 𝑧𝛼/2 × )
√𝑛 √𝑛
Let’s take a look of how to construct the 95% confidence interval estimate. As the confidence
𝜎
level is set at 95%, the sampling error is calculated as 1.96 × 𝑛 . If repeated sampling is
√
conducted and each time an interval is calculated based on the formula
𝜎 𝜎
(𝑥̅ − 1.96 × 𝑛 , 𝑥̅ + 1.96 × 𝑛)
√ √
0.025
0.025
𝜎
1.96
√𝑛
sample 1:
sample 2:
sample 3:
sample 4:
sample 5:
sample 6:
sample 7:
sample 8:
sample 9:
sample
10:
……
95% Confidence Intervals
We can see from the above diagram that most of the constructed intervals can cover the true
unknown population mean but only a few do not. In fact, of all these constructed intervals, 95%
can cover the true unknown population mean.
Practically, if only one random sample is selected, there is 95% chance that the constructed
confidence interval can successfully include the unknown population mean.
Example 1
As the sample mean lifetime of 50 light bulbs is 680 hours and the 95% sampling error is
calculated as 22.1749 hours, the 95% confidence interval estimate of the population mean
lifetime is:
80 80
(680 − 1.96 × , 680 + 1.96 × ) = (657.8251, 702.1749) hours
√50 √50
66
Applied Statistics
As a summary,
Example 2
The manager of a beauty counter wants to review the spending of the customers. The population
mean spending is unknown and the population standard deviation is $180. He estimates the
population mean spending by randomly select 60 customers. The sample mean spending of the
selected 60 customers is $880.
Solution
(a) The point estimate of the population mean spending is $880
(b) With σ = 180, n = 60,
180
the sampling error at 90% confidence level = 1.645 × = $38.2263
√60
(c) The 90% confidence interval estimate of the population mean is:
180 180
(880 − 1.645 × , 880 + 1.645 × ) = $ (841.77, 918.23)
√60 √60
➢ The population mean spending is point estimated as $880 with a 90% sampling error of
$38.2263.
67
Applied Statistics
𝑋̅ −𝜇
𝑧= follows a standard normal distribution
𝜎/√𝑛
𝑋̅ −𝜇
𝑡 = 𝑠/ follows a t-distribution with degrees of freedom n – 1
√𝑛
for which is the population standard deviation and s is the sample standard deviation, which is
the best estimator of the unknown population standard deviation.
The calculation of t-value is almost the same as the standard score z-value, but the population
standard deviation is replaced by the sample standard deviation. The sample standard deviation
is reasonably close to the population standard deviation and is a variable, which is different from
sample to sample. As a result, the t-distribution looks similar to the standard normal distribution
but with “fatter” tail. The t-distributions with different sample size are different. In fact, the
t-distribution is getting closer to the standard normal distribution by increasing the sample size.
68
Applied Statistics
Standardized
Normal
Z
t
0
T-distribution is very similar to the standard normal distribution, while t-distribution has
relatively fatter tails. When the degrees of freedom (degrees of freedom is defined as sample
size minus 1, df = n-1) increases, the t-distribution is getting more similar to the standard normal
distribution. The reason behind it is a larger sample size makes the sample standard deviation
a more accurate estimator of the population standard deviation. It is well accepted that when
the degrees of freedom is greater than 29, the t-distribution is well approximated by the standard
normal distribution.
Let’s use the standard normal table and t-table to look up the middle 95% data:
69
Applied Statistics
The entries in Table II are values for which the area to their right under the t distribution
with given degrees of freedom (the gray area in the figure) is equal to .
TABLE II VALUE OF t
70
Applied Statistics
By using t-distribution as a replacement of the standard normal distribution, now we can estimate
the population mean with the 3 steps procedure:
where 𝑡𝛼/2 is the critical value with 𝛼/2 as upper tail area and n-1 as the degrees of freedom.
Remark:
The t-distribution is developed with the assumption that the random variable X follows a normal
distribution. Practically, we can use the t-distribution to estimate the population mean when the
sample size is large enough (n > 30).
Example 3
In order to estimate the population mean age of patients of a dentist, a random sample of 20
patients is selected. The sample mean age is 37.4 and the sample standard deviation is 7.8.
Assume that the age of all patients follow a normal distribution.
Solution
With 𝑥̅ = 37.4, s = 7.8, n = 20, d.f. = 19, t19, 0.05 = 1.729
(a) point estimate of population mean age is 37.4
7.8
(b) 90% sampling error is 1.729 × = 3.0156
√20
7.8 7.8
(c) 90% C.I. of the population mean is (37.4 − 1.729 × , 37.4 + 1.729 × )
√20 √20
= (34.3844, 40.4156)
➢ The population mean age of all patients is point estimated as 37.4 with the 90% sampling
error of 3.0156.
71
Applied Statistics
Besides the estimation of the population mean, another commonly estimated parameter is the
population proportion.
Very often, we are interested in knowing the proportion of people in flavor to a particular option.
For example, what proportion of residents would support “Alex” to be the next president?
What proportion of people prefers the new flavor of green tea ice-cream when compare to the
chocolate ice-cream? What proportion of tourist would like to go to Japan as the destination of
the next vacation?
Similar to the estimation of the population mean by sample mean, we are going to use the sample
proportion as a point estimate of the population proportion. Before that, we need to review the
sampling distribution of sample proportion as in Chapter 6.
Suppose we start with a population with the population proportion equals to p. When random
samples with the same sample size are repeatedly drawn from the this population, the sample
proportions, can be viewed as a random variable and the sampling distribution of sample
proportion is:
𝑝(1 − 𝑝)
𝑝̂ ~𝑁(𝑝, )
𝑛
for n > 30, np > 5, n(1-p) > 5.
p is the notation of the unknown population proportion, while 𝑝̂ is the sample proportion.
72
Applied Statistics
We are going to develop the 3 steps estimation of the population proportion by using the similar
approach as the estimation of the population mean.
0.025
0.025
-1.96 0 1.96 Z
𝑝(1 − 𝑝) 𝑝(1 − 𝑝)
𝑝 − 1.96√ 𝑝 𝑝 + 1.96√ 𝑝̂
𝑛 𝑛
In general,
The unbiased point estimate of the population proportion is 𝑝̂
𝑝̂(1−𝑝̂)
The sampling error with 100(1 - α)% confidence level is 𝑧𝛼/2 √
𝑛
The 100(1 - α)% confidence interval estimate of the population proportion is
𝑝̂ (1 − 𝑝̂ ) 𝑝̂ (1 − 𝑝̂ )
(𝑝̂ − 𝑧𝛼/2 √ , 𝑝̂ + 𝑧𝛼/2 √ )
𝑛 𝑛
Example 4
Before the election, an organization has conducted a survey to investigate the supportive rate of
each candidate. The survey randomly interviewed 500 qualified voters, among them, 245
indicated that they would vote for Alex in the coming election.
(a) What is the point estimate of the population proportion of voters who support Alex?
(b) What is the sampling error at 95% confidence level?
(c) What is the 95% confidence interval estimate of the population proportion of voters who
support Alex?
Solution
Use p to denote the population proportion of voters who support Alex
(a) Point estimate of p = 0.49
0.49(0.51)
(b) 95% sampling error = 1.96√ = 0.04382
500
(c) 95% C.I. of p
0.49(0.51) 0.49(0.51)
= 0.49 − 1.96 , 0.49 + 1.96
500 500
= (0.4462, 0.5338)
➢ The population proportion of voters who vote for Alex is point estimated as 49% with the
95% sampling error of 4.38%.
73
Applied Statistics
Useful formulae
74
Applied Statistics
Why do we have to do the hypothesis testing? It’s because we have an assumption about the
population parameter that we are not quite sure whether it is true or not. However, we always
have a limitation that doing census is practically impossible. In such case, we can only base on
the information collected from a survey to test whether the assumption is likely to be correct or
likely to be wrong (with the consideration of sampling error).
1. Identify the variable of interest and the null hypothesis of the test
2. If your hypothesis is correct, what do you expect to be observed from the sampled data?
3. Collect data through a survey
4. Does the collected data support your null hypothesis?
75
Applied Statistics
Example 1
You are asked to check if the average sales per invoice this year is significantly different from
that of last year. Suppose in last year, the sales amount follows a normal distribution with
population mean $5000 and population standard deviation $1100. It is reasonable to assume
the sales amount this year follows a normal distribution with the same standard deviation as in
the previous year, however, whether the mean level has been changed significantly is not sure.
2. If the average amount of sales per invoice this year is $5000, when a survey is conducted, the
sample mean should be close to $5000 (with reasonable deviation due to sampling error).
3. Suppose a random sample of 30 invoices is collected and the sample mean is $5530.
4. Do we have strong evidence to reject the null hypothesis of = 5000 because the difference
between $5530 and $5000 is considered to be large? Or because the difference between
$5530 and $5000 is considered to be small, so we do not have evidence to reject the null
hypothesis?
In order to do the test statistically, we need to be familiarize with the concept of sampling
distribution, sampling error and Normal distribution.
Now, let’s try to understand the following concepts and present our test in a statistical approach.
76
Applied Statistics
The first step when we do the hypothesis testing is to list out the hypothesis! Actually there
should be two hypotheses.
Null Hypothesis, H0, is the statement that contains the assumption about the population (the equal
sign “=” is always included).
Alternative Hypothesis, H1, is the statement that we want to test against the null hypothesis (the
equal sign “=” should not be included).
Example
In our example, regarding to the average amount of each sales invoice in this year,
H0: μ = $5000 v.s. H1: μ ≠ $5000
Example
77
Applied Statistics
For a continuous random variable X with population mean 𝜇 and population standard deviation
, the sample mean distribution for samples with sample size n consists the following
characteristics:
(i) 𝐸(𝑋̅) = 𝜇
𝜎2
(ii) 𝑉𝑎𝑟(𝑋̅) =
𝑛
σ
(iii) SE(𝑋̅) =
√n
(iv) 𝑋̅ is normally distributed either n ≥ 30 or X is originally normally distributed
In example 1, we have σ = $1100. If the null hypothesis of µ = $5000 is correct, when we select
a sample with sample size n = 30, we should expect the sample mean is reasonably close to $5000.
When the observed sample mean is located significantly different from $5000, with the
probability of this happening is very small, we have a reason to reject the null hypothesis
statistically. In a critical value approach, we need to set up rejection region(s) and non-rejection
region in order to help determining whether the null hypothesis should be rejected.
This is the graph for a two-tailed test. Later, we will discuss the one-tailed test.
78
Applied Statistics
No matter whether it is a two-sided test or one-sided test, we need to make our decision “Can the
null hypothesis be rejected?” based on the summary statistics we compiled from a sample. We
either:
H0 is true H0 is false
Do not reject H0 Type II error
Probability =
Reject H0 Type I error
Probability =
Type I error: The null hypothesis H0 is correct, but as a very extreme sample is obtained
which indicate a violation of the null hypothesis, so the null hypothesis is
rejected.
Type II error: The null hypothesis H0 is wrong, but as a sample very close to the null
hypothesis is obtained that the null hypothesis is perceived as true, so the null
hypothesis is not rejected.
For example, if the test statistics come from the standard normal distribution, for a two tailed test
with 5% significance level, the rejection region is when z > 1.96 or z < -1.96
79
Applied Statistics
The test statistics is a function of the collected data in the sample. Instead of directly compare
the sample mean to the assumption, it is more convenience if we convert the normal distribution
to the standard normal distribution.
Example
As, in our case, we want to test the population mean = 5000, a z-score is calculated by
̅ −5000
X
Z= 1100⁄
√30
1100 2
So when 𝑋̅ ~𝑁(5000, ) is true, Z ~ N(0, 12) is true and we should expect a z-score calculated
30
from the sample should be close to 0.
➢ When we have a sample mean with the calculated z-score is reasonably close to 0, the null
hypothesis is not rejected and is considered to be correct. Otherwise, the null hypothesis
is rejected and is considered to be wrong.
➢ At a 5% significance level test, the null hypothesis is rejected when the z-score < -1.96 or z-
score > 1.96 for a two-tailed test.
Combine all the concepts together, we can present our test statistically by the following steps:
1. Define the null hypothesis and the alternative hypothesis (2-tailed or 1 tailed)
2. Define the rejection region(s) (based on the level of significance and whether it is a 2-tailed
or 1-tailed test)
3. Compile the test statistics
4. Make conclusion: (i) H0 is rejected or (ii) H0 is not rejected
The above procedures do not only apply for testing a population mean, it can also be applied to
different situations. In this chapter, we will look at many types of hypothesis testing.
80
Applied Statistics
When the variable X is a quantitative variable, the null hypothesis involves the statement if the
population mean of X equals to a particular value, μ0. Sample mean collected from sample /
experiment will be used to conduct the test.
81
Applied Statistics
Z
0
Z
0
Z
0
82
Applied Statistics
x − 0
Step 3: Calculate the z-statistics z=
/ n
Step 4: When the z-score falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.
83
Applied Statistics
Example 1
You are asked to check if the average sales per invoice this year is significantly different from
that of last year. Suppose in last year, the sales amount follows a normal distribution with
population mean $5000 and population standard deviation $1100. It is reasonable to assume
the sales amount this year follows a normal distribution with the same standard deviation as in
the previous year. A sample of 30 invoices in this year indicates the sample mean is $5530.
Test, at the 5% level of significance, if the population mean this year is different from last year.
Solution
Denote X as the sales (quantitative variable) and z test is launched to test whether the population
mean is different from last year ($5000).
Step 2. Reject H0 when z < -1.96 or when z > 1.96 (z0.025 = 1.96)
5530−5000
Step 3. 𝑧= 1100 = 2.6390
√30
Conclusion: There is sufficient evidence to conclude that the average sales per invoice this year
is different from last year.
84
Applied Statistics
p-value approach
Another testing procedure to do the hypothesis testing is called p-value approach. By assuming
the null hypothesis is correct and the sampling distribution of the test statistics is known, the
chance to have a more extreme test statistics than the one is observed is compiled as the p-value.
If the p-value is smaller than the significance level, the null hypothesis is rejected; otherwise, if
the p-value is greater than or equal to the significance level, the null hypothesis is not rejected.
For two tailed test,
Redo the test in Example in p.82 the p-value is 2 P(Z
1. H0: μ = $5000 v.s. H1: μ ≠ $5000 > |z|). If the p-
value is small, the
5530−5000 test statistics would
2. 𝑧= 1100 = 2.6390, fall in the rejection
√30 region as in the
critical value
p-value = P(Z > 2.6390) + P(Z < -2.6390) = 0.0082
approach.
85
Applied Statistics
1. Define the null hypothesis and the alternative hypothesis (2-tailed or 1 tailed)
2. Compile the test statistics
3. Find the p-value
4. Make conclusion: (i) H0 is rejected or (ii) H0 is not rejected
Finding p-value
The p-value of the test statistics aims to indicate how extreme is the observation. For the three
sets of alternative hypotheses, there are three corresponding methods to check the p-value.
Making conclusion
If the p-value is smaller than the significance level, the null hypothesis is rejected.
(The test statistics would fall in the rejection region as in the critical value approach)
If the p-value is greater than or equal to the significance level, the null hypothesis is not rejected.
(The test statistics would fall in the non-rejection region as in the critical value approach)
86
Applied Statistics
Practically, when we work out the hypothesis testing for the population mean, we may not know
the population standard deviation. In this case, sample standard deviation is used to replace the
population standard deviation when calculating the test statistics together with normal
distribution is replaced by t distribution for checking the rejection region.
Step 1: The null hypothesis and alternative hypothesis are set up similarly as in test I.
Step 2: Set up rejection region(s) from t-distribution with degree of freedom = n - 1 as:
x − 0
Step 3: Calculate the t-statistics t=
s/ n
Step 4: When the t statistics falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.
Remark:
If the t-test is presented in p-value approach, the p-value should be found from t-distribution with
the corresponding degree of freedom instead of locating from the normal table.
87
Applied Statistics
Example 2
The manufacturer claims that the volume of soft drink in a bottle follows a normal distribution
with mean 2 liters. A sample of 20 2-liters bottles are selected and found to have the sample
mean of 1.98 liters and standard deviation of 0.18 liters. According to the survey result, is there
evidence that the population mean amount of soft drink filled is less than 2.0 liters at the 0.05
level of significance?
Solution
Denote X as the volume of soft drink in a bottle (quantitative variable) and the test is about the
population mean is less than 2.0 liters.
Step 2. With unknown population standard deviation and sample size n = 20, i.e. d.f. = 19
Reject the null hypothesis when t < -1.729 (t19, 0.05 = 1.729)
1.98 − 2
Step 3. t= = −0.4969
0.18 / 20
Conclusion: There is no evidence to say that the population mean amount of soft drink per bottle
is less than 2.0 liters.
88
Applied Statistics
When the variable X is a qualitative variable, the null hypothesis may involve the statement
saying if the proportion of one option equals to a particular value. For example, you may want
to test whether a coin is fair by assuming the proportion of head obtained in a long run equals to
0.5. Sample proportion collected from survey / experiment would be used to conduct the test.
Again, three forms of alternative hypothesis may be resulted as:
𝑝̂−𝑝0
Step 3: Calculate z-statistics as 𝑍=
𝑝 (1−𝑝0 )
√ 0
𝑛
p(1 − p)
As pˆ ~ N p, , when the null hypothesis, H0: p = p0, is true, the above z-
n
statistics should follow the standard normal distribution and is likely to result as a value
closes to 0.
Step 4. When the z-score falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.
89
Applied Statistics
Example 3
A coin is suspected if it is fair. This single coin is tossed 200 times and 120 heads are obtained.
Test, at the 0.10 level of significance, if the coin is fair?
Solution
Let p be the population proportion of head. When the coin is fair, p = 0.5.
Conclusion: With the hypothesis that the coin is fair is rejected, the coin is concluded as unfair.
90
Applied Statistics
IV. t-test: Hypothesis for the Difference between Two Means (Dependent)
Sometimes, two measurements would be recorded from the same subject and a comparison
between the two measurements are required. For example, every student have to take two
assessments, the mid-term examination and final examination; or the blood pressure of each
patient before and after taking the medicine are recorded. In this case, the difference D, for each
set of dependent measurements is compiled and the test of any significant difference between the
two populations would have the null hypothesis H0: µD = 0, which is converted as one population
test. In the coming example, student performance in mid-term examination and final
examination will be compared so that the hypothesis that student performs better in the final
examination can be conducted.
Step 2: Set up rejection region(s) from t-distribution with degree of freedom = n – 1 as:
Test Rejection Region
a. H0: D = 0 v.s. H1: D 0 either t is too large (t > tα/2) or
t is too small (t < -tα/2).
b. H0: D = 0 v.s. H1: D 0 t is too large (t > tα).
c. H0: D = 0 v.s. H1: D 0 t is too small (t < - tα).
xD
Step 3: Calculate the t-statistics as t=
sD / n
When the null hypothesis is true, the above t-statistics would follow the T-distribution
with degrees of freedom n – 1.
Step 4: When the t statistics falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.
Remark:
Define D clearly and keep it consistent for the whole test.
91
Applied Statistics
Example 4
The following is the results of a sample of 8 students from a school. Test at the 0.05 level of
significance if students perform better in the end of term examination.
Mid-term examination 52 58 63 78 61 70 82 74
End-of-term examination 58 55 69 79 64 68 90 77
Solution
The test is about if students perform better in the end of term examination.
Define D = End of term examination – Mid term examination
d: 6 -3 6 1 3 -2 8 3
2.75
Step 3. t= = 1.9848
3.9188/ 8
Step 4. As 1.9848 > 1.895, the null hypothesis is rejected at the 5% significance level.
Conclusion: There is sufficient evidence to conclude that students perform better in the end of
term examination.
92
Applied Statistics
Suppose there are two independent populations (for example, male v.s. female, or new production
line v.s. old production line). We may need to test if the population means (as variable is
quantitative) of these two independent populations are the same. Typical example is to compare
the spending power of male customers and female customers. In such case, independent
samples would be selected separately from the two populations and a comparison of the sample
means would be made with the consideration of the possible sampling errors.
Step 1: The null hypothesis and alternative hypothesis are set up as:
z=
(x1 − x2 )
Step 3: Calculate the z-statistics as
12 22
+
n1 n2
If the two independent samples are large enough, or if the two populations follow
independent normal distributions, then the two sample means follow independent
normal distributions,
𝜎 2 𝜎 2
𝑋̅1 ~𝑁(𝜇1 , 1 ) and 𝑋̅2 ~𝑁(𝜇2 , 2 )
𝑛1 𝑛2
When many possible pairwise comparison between sample from population 1 and
2 2
sample from population 2 are made, 𝑋 ̅1 − 𝑋̅2 ~ 𝑁(𝜇1 − 𝜇2 , 𝜎1 + 𝜎2 )
𝑛1 𝑛2
If the null hypothesis is true, the above z-statistics should follow the standard normal
distribution.
Step 4: When the z-score falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.
93
Applied Statistics
Example 5
The manager of a supermarket wants to find evidence to support the assumption that the average
spending made by female customers is significantly more than that by male customers. It is
assumed that the spending made by female customers follows a normal distribution with
uncertain mean and standard deviation of $125. For the male customers, the spending is
supposed to follow a normal distribution with uncertain mean and standard deviation of $110.
A random sample of 23 female has a mean spending of $375 and another independent sample of
25 male has a mean spending of $362. Is there any evidence of a higher level of mean spending
is made by female customers than male customers at the 0.05 level of significance?
Solution
Define X as the spending of each customer (quantitative variable). The test is about the
population mean of female (F) is higher than the population mean of male (M).
Conclusion: There is no evidence to say that on the average the female customers spend more
than male customers.
94
Applied Statistics
The calculation of the test statistics becomes complicated when it involves two samples. Below
is the report generated by Excel with the test conducted at 0.05 level of significance. Let’s read
it and generate a hypothesis testing report and derive the conclusion from it.
The first few lines of the report are straight forward which give a summary of the two datasets.
Be aware that the Excel output presents both results of one-tailed test and two-tailed test. You
need to pick up the appropriate set of result according to the alternative hypothesis so to generate
the report.
As in our example, the alternative hypothesis is one-tailed test (right tailed), the following report
can be derived from the Excel output:
Step 2. Reject H0 if z > 1.6449 (Critical value for one-tailed test, right tailed)
Conclusion: There is no evidence to say that on the average the female customers spend more
than male customers.
375−362
Remark: 𝑧 = 15625 12100
= 0.3811
√ +
23 25
95
Applied Statistics
Step 4. As p-value = 0.3515 > 0.05, the null hypothesis is not rejected.
Conclusion: There is no evidence to say that on the average the female customers spend more
than male customers.
96
Applied Statistics
We want to do the similar test as the session (V). However, this time the population variances
of the two independent populations are unknown. In this case, we need to estimate the variances
by the sample variances. Instead of estimate the population variances separately, we need to
a. confirm / assume the two independent populations are normally distributed and
b. the population variances are unknown but equal. Then the pooled-variance is estimated by
(n − 1) s12 + (n2 − 1) s 22
s 2p = 1 .
(n1 − 1) + (n2 − 1)
s2
=
(x − x) 2
s 2
=
( y − y) 2
1 2
(n1 − 1) (n2 − 1)
s 2
=
( x − x) + ( y − y)
2 2
Do you remember
p
(n1 − 1) + (n2 − 1) the calculation of
(n1 − 1) s12 + (n2 − 1) s22 weighted average?
=
(n1 − 1) + (n2 − 1)
Step 1: The null hypothesis and alternative hypothesis are set up as:
Step 2: Set up rejection region(s) from t-distribution with degree of freedom = n1 + n2 – 2 as:
x1 − x 2
Step 3: Calculate the t-statistics as t=
1 1
s 2p +
n1 n2
The similar calculation of the test statistics in the previous test with the population
variances are replaced by the pooled variance.
Step 4: When the t statistics falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.
97
Applied Statistics
Example 6
The marketing team wants to know if the product being displayed in different position in the
supermarket would have a significant effect to the sales performance. The sales performance
of Billy cola is used to conduct a test. The following is the weekly sales of Billy cola collected
by different supermarkets with the colas are displayed in (a) normal shelf and (b) promotion area.
Test at the 0.01 level of significance if the average weekly sales of Billy cola when displayed in
promotion area is higher than when it is displayed in normal shelf. Assume the sales from both
display types follow normal distribution with equal variance.
Below is the Excel output with the test conducted at 0.01 level of significance:
Normal Promotion
shelf area
Mean 48.1 72
Variance 167.6556 157.3333
Observations 10 10
Pooled Variance 162.4944
Hypothesized Mean Difference 0
df 18
t Stat -4.1924
P(T<=t) one-tail 0.0003
t Critical one-tail 2.5524
P(T<=t) two-tail 0.0005
t Critical two-tail 2.8784
98
Applied Statistics
Step 2. Reject H0 when t < -2.552 (Critical value for one-tailed test, left-tailed)
Conclusion: There is sufficient evidence indicating that the average weekly sales of Billy cola
displayed at the promotion area is higher than that being displayed in the normal shelf.
48.1− 72
Remark: t = = −4.1924
1 1
162.4944 +
10 10
Report in p-value approach
Define X as the weekly sales of Billy cola (quantitative). The test is about the population mean
weekly sales from supermarket with items displayed in promotion area (Pro) is higher than the
population mean weekly sales from supermarket with items displayed in normal shelf (N).
Step 4. As 0.0003 < 0.01, the null hypothesis is rejected at 1% level of significance.
Conclusion: There is sufficient evidence indicating that the average weekly sales of Billy cola
displayed at the promotion area is higher than that being displayed in the normal shelf.
99
Applied Statistics
Here, we would like to compare the population proportions from two independent populations.
In marketing study, we always want to test whether male and female response similarly to a
particular product. In the coming example, we need to conclude whether the proportion of
male prefer Chinese tea to Japanese would be similar to the proportion of female prefer Chinese
tea to Japanese tea. When testing the null hypothesis that the two independent population
proportions are the same, H0: p1 = p 2 , we need to use the sample data to generate three
proportions, 𝑝̂1as sample proportion for population 1, 𝑝̂2 as sample proportion for population 2,
n pˆ + n2 pˆ 2
a pooled sample proportion by combining all data together, pˆ = 1 1 .
n1 + n2
Step 1: Set up null hypothesis and alternative hypothesis as follow:
The two independent sample proportions are resulted from the following two normal
distributions:
𝑝 (1−𝑝 ) 𝑝 (1−𝑝 )
𝑝̂1 ~𝑁(𝑝1 , 1 𝑛 1 ) 𝑝̂2 ~𝑁(𝑝2 , 2 𝑛 2 )
1 2
When many possible pairwise comparison between sample from population 1 and
𝑝 (1−𝑝 ) 𝑝 (1−𝑝 )
sample from population 2 are made, 𝑝̂1 − 𝑝̂2 ~𝑁(𝑝1 − 𝑝2 , 1 𝑛 1 + 2 𝑛 2 )
1 2
When the null hypothesis is true: p1 = p2 = p, it becomes
𝑝(1−𝑝) 𝑝(1−𝑝)
𝑝̂1 − 𝑝̂2 ~𝑁(0, + )
𝑛1 𝑛2
The above z-statistics should follow the standard normal distribution.
Step 4: When the z-score falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.
100
Applied Statistics
Example 7
A “tea-lover” group wants to know what kind of tea is more preferable for male and female.
The marketing team has invited 200 male and 180 female to try two types of tea, Chinese tea and
Japanese tea. The result of the survey indicates that 70% male in the sample prefers Chinese
tea and 65% of female in the sample prefers Chinese tea. Test, at the 0.05 level of significance,
if the proportion of male prefers Chinese tea and proportion of female prefers Chinese tea in the
population are significantly different.
Below is the Excel output with the test conducted at 0.05 level of significance:
Male Female
Proportion 0.7 0.65
Variance 0.2189 0.2189
Observations 200 180
Hypothesized Proportion Difference 0
z 1.0401
P(Z<=z) one-tail 0.1491
z Critical one-tail 1.6449
P(Z<=z) two-tail 0.2983
z Critical two-tail 1.9600
Step 2. Reject H0 when z < -1.96 or when z > 1.96 (Critical value for two-tailed test)
Step 4. As -1.96 < 1.0401 < 1.96, H0 is not rejected at 5% significance level.
Conclusion: There is no evidence to say that the proportion of male prefers Chinese tea and
proportion of female prefers Chinese tea in the population are different.
0.7 − 0.65
Remark: z = = 1.0401
1 1
0.6763(0.3237) +
200 180
101
Applied Statistics
Conclusion: There is no evidence to say that the proportion of male prefers Chinese tea and
proportion of female prefers Chinese tea in the population are different.
102
Applied Statistics
1. Check that Add-ins function “Analysis ToolPak” has been installed before your start your
work.
2. Input data in the Excel worksheet.
3. From the menu bar, select Data > Data Analysis > t-Test: Two-Sample Assuming Equal
Variances.
4. Input Range for dataset 1, dataset 2, set null hypothesis value as 0, tick the box Labels if
you have included variable name in your dataset, and input alpha value which is the level of
significance of the test.
103
Applied Statistics
Useful formulae
104
Applied Statistics
In many situations, you need to examine differences of means of a quantitative variable among
many groups of individuals. For example, the policy maker may want to compare the traveling
expense for people living in different districts before suggesting the traveling expense allowance
scheme. By grouping residentials in different areas (Hong Kong Island, Kowloon, New
Territories), the objective is to test if the mean traveling expense among different groups are all
the same (null hypothesis) or the mean traveling expense among different groups are not all the
same (alternative hypothesis). This kind of test is named as one way analysis of variance
(ANOVA).
One way ANOVA test is particularly used for quantitative variable with more than 2 populations
(groups). Assuming that c groups represent populations whose values are randomly and
independently selected, follow a normal distribution, and have equal variances. The null
hypothesis of no differences in the population means against the alternative hypothesis that not
all the c population means are equal:
H0: μ1 = μ2 = … = μc
H1: not all μj are equal (where j = 1, 2, …, c)
Imagine data (traveling expense to work on 3 September 2019) collected for the test about the
traveling expense is as below:
105
Applied Statistics
F statistics based on the provided data should be calculated in order to justify if there is sufficient
evidence to reject the null hypothesis. In order to calculate the F statistics, you should start with
preparing the summary for the data
Referring to the above summary table, you should aware the sample means for group 1, group 2,
group 3 are not the same. With the consideration of possible sampling error, how would we
justify the differences between the means are reasonably small due to random error or it is
because the population means are not all the same?
You need to fill up the ANOVA summary table to calculate the F statistics as
SST: measure the total variation between data to the overall mean
SSA: measure the total variation between group mean to the overall mean
SSW: measure the total variation between data to the group mean
106
Applied Statistics
Component MSA measures variation between group mean to the overall mean while the MSW
measures the variation between data to the group mean. If the null hypothesis is true, the
calculated F statistics should be close to 1. When the calculated F statistics is significantly large,
it is a strong evidence to reject the null hypothesis.
With the calculated F statistics as 3.9459, you need to compare this with the critical value found
from the F distribution with degree of freedom (2, 8)
107
Applied Statistics
F Distribution
The next two pages show the critical value for F distribution with degree of freedoms r1 and r2 at
5% and 1% level of significance. When the resulted F-statistics is greater than the
corresponding critical value, it is a strong evidence that the null hypothesis that all means are the
same to be rejected.
108
Applied Statistics
The entries in Table III are values for which the area to their right under the F
distribution with given degrees of freedom (the gray area in the figure) is equal
to
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91
3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83
109
Applied Statistics
The entries in Table IV are values for which the area to their right under the F
distribution with given degrees of freedom (the gray area in the figure) is equal
to
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10
14 8.86 6.52 5.56 5.04 4.70 4.46 4.28 4.14 4.03 3.94
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.90 3.81
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69
17 8.40 6.11 5.19 4.67 4.34 4.10 3.93 3.79 3.68 3.59
18 8.29 6.01 5.09 4.58 4.25 4.02 3.84 3.71 3.60 3.51
19 8.19 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26
23 7.88 5.66 4.77 4.26 3.94 3.71 3.54 3.41 3.30 3.21
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17
25 7.77 5.57 4.68 4.18 3.86 3.63 3.46 3.32 3.22 3.13
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47
6.64 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32
110
Applied Statistics
In order to justify if the means of c populations are the same, one-way ANOVA test should be
conducted as follow:
Step 4. When the F statistics falls into the rejection region, the null hypothesis is rejected,
otherwise the null hypothesis is not rejected.
111
Applied Statistics
Example
The Social Welfare Department is doing a research about the traveling expense. Traveling
expense to work on 3 September 2019 are collected for samples of individual living in Hong
Kong Island, Kowloon, and New Territories. Test, at 5% significance level, if the mean
traveling expense by resident living in Hong Kong Island, Kowloon, New Territories are the same
at 5% significance level.
Solution
Step 1: H0: µHK Island = µKowloon = µNT
H1: not all μ are equal
Step 3:
Group Number of data Mean
1 3 8
2 4 8.75
3 4 12.5
Total 11 9.9091
Conclusion: The average traveling expense for residents living in Hong Kong Island, Kowloon,
and New Territories are concluded to be the same.
112
Applied Statistics
When the data set is getting large, the calculation of F statistics become challenging. Below
shows the output report by running the one-way ANOVA in Excel at 0.05 level of significance:
SUMMARY
Groups Count Sum Average Variance
HK Island 3 24 8 9
Kowloon 4 35 8.75 2.916667
NT 4 50 12.5 5.666667
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 43.15909 2 21.57955 3.945974 0.064217 4.45897
Within Groups 43.75 8 5.46875
Total 86.90909 10
Referring to the above report, you can report the result of the test in critical value approach
as:
Conclusion: The average traveling expense for residents living in Hong Kong Island, Kowloon,
and New Territories are concluded to be the same.
You can report the result of the test in p-value approach as:
Step 4: As p-value = 0.0642 > 0.05, H0 is not rejected at 0.05 level of significance
Conclusion: The average traveling expense for residents living in Hong Kong Island, Kowloon,
and New Territories are concluded to be the same.
113
Applied Statistics
Imagine a fair die is tossed 120 times. As the die is fair, we expect among 120 tossing, each
face of number 1, 2, 3, 4, 5, 6 would be observed for 20 times. Practically, it is not surprised if
some deviations from the expected frequencies is observed (for example, number 1 is observed
22 times instead of 20 times). Consider the following chi-square statistics is calculated:
(Oi − Ei) 2
2 =
i Ei
Number 1 2 3 4 5 6
Expected Frequency 20 20 20 20 20 20
Observed Frequency
A long series of simulation by computer program suggests that if the experiments are conducted
repeatedly (for example, 100000 times) and each time the corresponding chi-square statistics is
compiled, the chi-square statistics should follow chi-square distribution with degrees of freedom
k-1, where k is the number of subgroups in this variable. In our case, k = 6 so the degree of
freedom is 5.
114
Applied Statistics
Chi-Square Distribution
By considering the observed frequency is the result of one of many possible samples, we can test
the hypothesis if it fits the particular probability distribution function by the following procedures:
Step 4. When the chi square statistics falls into the rejection region, the null hypothesis is
rejected, otherwise the null hypothesis is not rejected.
115
Applied Statistics
The entries in Table V are values for which the area to their right under the chi-square distribution
with given degrees of freedom (the gray area in the figure) is equal to .
TABLE V VALUES OF 2
d.f. d.f.
0.2 05 0.2 01
1 3.841 6.635 1
2 5.991 9.210 2
3 7.815 11.345 3
4 9.488 13.277 4
5 11.070 15.086 5
6 12.592 16.812 6
7 14.067 18.475 7
8 15.507 20.090 8
9 16.919 21.666 9
10 18.307 23.209 10
11 19.675 24.725 11
12 21.026 26.217 12
13 22.362 27.688 13
14 23.685 29.141 14
15 24.996 30.578 15
16 26.296 32.000 16
17 27.587 33.409 17
18 28.869 34.805 18
19 30.144 36.191 19
20 31.410 37.566 20
21 32.671 38.932 21
22 33.924 40.289 22
23 35.172 41.638 23
24 36.415 42.980 24
25 37.652 44.314 25
26 38.885 45.642 26
27 40.113 46.963 27
28 41.337 48.278 28
29 42.557 49.588 29
30 43.773 50.892 30
116
Applied Statistics
Example
An ordinary die is thrown 120 times and each time the number on the uppermost face is noted.
The results are as follows:
Number 1 2 3 4 5 6 Total
Observed Frequency 14 16 24 22 24 20 120
Solution
Step 1. H0 : “1” : “2” : “3” : “4” : “5” : “6” = 1 : 1 : 1 : 1 : 1 : 1
H1 : “1” : “2” : “3” : “4” : “5” : “6” ≠ 1 : 1 : 1 : 1 : 1 : 1
(
2
Step 2. Reject H0 if χ2 > 11.070 (0.05, 5) =11.070)
Step 3. Outcome 1 2 3 4 5 6
Observed frequency 14 16 24 22 24 20
Expected frequency 20 20 20 20 20 20
117
Applied Statistics
Example 1
When you want to rent a flat, do you think if there is any relationship between the size of the flat
and the monthly rental cost? Does a bigger flat worth a higher monthly rental cost? Is it
possible to predict the monthly rental cost by knowing the size of the flat? In order to review
the relationship between the size of the flat and the monthly rental cost, here below is the
information collected from a property agency for a sample of 10 flats:
Apartment X Y
Size (square feet) Monthly Rent ($)
1 700 8200
2 650 7500
3 690 7900
4 500 6700
5 820 10500
6 730 7900
7 740 7500
8 680 6800
9 540 6300
10 670 7000
118
Applied Statistics
Scatter Diagram
A scatter diagram is used to review the relationship between two variables by plotting a sample
of (x,y) data in a x-y plane. The nature of the relationship between two variables can take many
forms. The simplest relationship consists of a straight-line, which is called the linear
relationship.
Y Y
X X
Y Y
X X
119
Applied Statistics
Example 1
This is the scatter plot between the size of the flat (X) and the monthly rental cost (Y). Would
you say the correlation is positive or negative?
12000
10000
8000
6000
4000
2000
0
0 100 200 300 400 500 600 700 800 900
120
Applied Statistics
Coefficient of Correlation
While the scatter plot is very useful for us to visualize the relationship, the strength of the
relationship cannot be read out precisely. The coefficient of correlation, r, is a measure of the
strength of a linear relationship between two variables. The measurement r ranges from -1 to
1, where -1 indicates a perfect negative linear relationship and 1 indicates a perfect positive linear
relationship. The coefficient of correlation, r, is defined as
n xy − x y
r=
n x 2 − ( x )2 n y 2 − ( y )2
In general, when |r| < 0.3 the relationship is pretty weak. When |r| is around 0.5 the relationship
is moderate. While |r|> 0.7 indicates a strong relationship.
Example 1
Consider the following information about the monthly rental cost (Y) and the size of the apartment
(X). Compile the correlation coefficient and comment on it.
Apartment x y x2 y2 xy
Size Monthly
(square feet) Rent ($)
1 700 8200 490000 67240000 5740000
2 650 7500 422500 56250000 4875000
3 690 7900 476100 62410000 5451000
4 500 6700 250000 44890000 3350000
5 820 10500 672400 110250000 8610000
6 730 7900 532900 62410000 5767000
7 740 7500 547600 56250000 5550000
8 680 6800 462400 46240000 4624000
9 540 6300 291600 39690000 3402000
10 670 7000 448900 49000000 4690000
Total 6720 76300 4594400 594630000 52059000
Solution
r = 0.7938 (from calculator)
Remark:
10(52059000 ) − (6720)(76300)
r= = 0.7938
10(4594400) − 6720 10(594630000) − 76300
2 2
➢ This indicates a strong positive correlation between the size of the apartment and the
monthly rental cost.
121
Applied Statistics
The main tool in diagnosing whether a correlation is spurious or not is to examine the quality of
the theory behind it. In the case of tobacco and lung cancer, only a clear explanation for the
biological mechanism that caused smoking to lead to lung cancer settled the debate.
122
Applied Statistics
When the correlation measures the strength of relationship, it does not indicate the cause and
effect relationship between the two variables. When one variable (dependent variable, Y) is
assumed to be dependent on the other variable (independent variable, X) linearly, the simple linear
regression model can be used to outline the relationship between them.
It is reasonable to assume the monthly rental cost depends on the size of the flat, while it sounds
a bit strange if we say that the size of the flat depends on the monthly rental cost. So the monthly
rental cost is the dependent variable (Y), which value depends on the independent variable (X),
the size of the flat. Put it in a linear model, Y = a + bX. However, what values of a and b most
suitable to explain the relationship for this set of (X,Y)?
12000
Y
10000
8000
6000
Y = a + bX
4000
2000
0 X
0 100 200 300 400 500 600 700 800 900
123
Applied Statistics
Converting the random error as 𝜀𝑖 = 𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 . The idea of finding the best linear
regression line for the data is to find the best pair of a and b so that the accumulated squared error
is the smallest. The best straight line (least square regression line) that represents the points:
y = a + bx,
where b =
n xy − x y
, and a =
y −b x
n x 2 − ( x )
2
n n
Example 1
Assume linear relationship is found between the monthly rental cost (Y) and the size of the
apartment (X). Fit the regression line Y = a + bX and interpret the values of a and b.
Apartment x y x2 xy
Size Monthly
(square feet) Rent ($)
1 700 8200 490000 5740000
2 650 7500 422500 4875000
3 690 7900 476100 5451000
4 500 6700 250000 3350000
5 820 10500 672400 8610000
6 730 7900 532900 5767000
7 740 7500 547600 5550000
8 680 6800 462400 4624000
9 540 6300 291600 3402000
10 670 7000 448900 4690000
Total 6720 76300 4594400 52059000
124
Applied Statistics
12000
10000
6000
4000
2000
0
0 100 200 300 400 500 600 700 800 900
Solution
a = 911.7108, b = 9.9975 (from calculator)
Remark:
10(52059000 ) − (6720)(76300)
b= = 9.9975
10(4594400) − (6720) 2
76300 6720
a= − 9.9975 = 911.68
10 10
Remark
Interpret the value of a alone does not make any sense (refer to extrapolation estimation in the
next page)
125
Applied Statistics
One important application of the regression model is to predict / estimate the value of dependent
value y for a given x. The value of y for a given x is estimated as:
yˆ = a + bx
Rather the estimation is reliable or not depends on i) the value of x and ii) the correlation between
x and y.
When the value of x lies within the range of (Minimum X, Maximum X) in the dataset, this
estimation is called interpolation. For interpolation estimation with strong correlation, the
estimation is reliable; for interpolation estimation with moderate correlation, the reliability of the
estimation is questionable; for interpolation estimation with weak correlation, the estimation is
unreliable.
When the value of x is outside the range of (Minimum X, Maximum X) in the dataset, the
estimation is called extrapolation. The extrapolation estimation is always unreliable as we
cannot guarantee the linear relationship is still valid outside the range of the existing dataset. So
extrapolation estimation should be avoided.
Example 1
What are the estimated monthly rental cost for (a) a 800 square feet flat and (b) a 2000 square
feet flats? Comment on their reliabilities with reasons.
Solution
(a) 𝑦̂ = 911.7108 + 9.9975(800) = 8909.71 ($)
➢ When a flat is 800 square feet, the estimated monthly rental cost is $8909.71. The
estimation is reliable as it is interpolation estimation with high correlation
126
Applied Statistics
Rank Correlation
The rank correlation, rs, is the measure of relationship between two variables when the ranks,
instead of the actual values, of the two variables are used.
Example 2
A kid is invited to do a blinded taste test of 6 ice-creams. After tasting all the ice-creams, he
arrange them in ascending order according to how much he likes them. Below is the
information about the price and the kid’s rating of the ice-creams.
Calculate the rank correlation between the price and the kid’s rating. Comment on it.
Solution
6(26)
𝑟𝑠 = 1 − = 0.2571
6(62 −1)
There is a weak positive correlation between the price and the kid’s rating of these 6 ice-
creams.
Remarks:
1. When there is no tied data, the calculation of rank correlation can be done by calculator
(regression mode) by inputting the rank data.
2. Where there is tied data, the same rank should be assigned to the tied data by taking average
of the ranks and a correction factor should be applied in the calculation of rs.
127
Applied Statistics
Data Set:
x 5.2 7.3 8.8 10.2 13.1 14.4 15.2 16.6 18.3 19.7 20.3 20.5
y 1.6 2.2 1.4 1.9 2.4 2.6 2.3 2.7 2.8 2.6 2.9 3.1
3. Input Data
5.2 , 1.6 DT
7.3 , 2.2 DT
: :
20.3 , 2.9 DT
20.5 , 3.1 DT
4. Essential statistics
n: SHIFT 1 3 EXE = 12
x : SHIFT 1 2 EXE = 169.6
x
2
: SHIFT 1 1 EXE = 2702.7
y
2
: SHIFT 1 1 EXE = 70.69
128
Applied Statistics
The entries in Table I are the probabilities that a random variable having the
standard normal distribution will take on a value between 0 and z. They are given
by the area of the gray region under the curve in the figure.
z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4648 0.4656 0.4664 0.4671 0.4678 0.4685 0.4692 0.4699 0.4706
1.9 0.4713 0.4719 0.4725 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
3.0 0.4987 0.4987 0.4987 0.4988 0.4988 0.4989 0.4989 0.4989 0.4990 0.4990
Also, for z = 4.0, 5.0 and 6.0, the areas are 0.49997, 0.4999997, and 0.499999999.
129
Applied Statistics
The entries in Table II are values for which the area to their right under the t
distribution with given degrees of freedom (the gray area in the figure) is equal
to .
TABLE II VALUE OF t
130
Applied Statistics 2019-20
The entries in Table III are values for which the area to their right under the F
distribution with given degrees of freedom (the gray area in the figure) is equal to
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90 2.85
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71 2.67
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65 2.60
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59 2.54
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42 2.38
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37 2.32
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 2.27
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24
30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21 2.16
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12 2.08
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99
120 3.92 3.07 2.68 2.45 2.29 2.18 2.09 2.02 1.96 1.91
3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83
131
Applied Statistics 2019-20
The entries in Table IV are values for which the area to their right under the F
distribution with given degrees of freedom (the gray area in the figure) is equal to
11 9.65 7.21 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54
12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30
13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10
14 8.86 6.52 5.56 5.04 4.70 4.46 4.28 4.14 4.03 3.94
15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.90 3.81
16 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69
17 8.40 6.11 5.19 4.67 4.34 4.10 3.93 3.79 3.68 3.59
18 8.29 6.01 5.09 4.58 4.25 4.02 3.84 3.71 3.60 3.51
19 8.19 5.93 5.01 4.50 4.17 3.94 3.77 3.63 3.52 3.43
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37
21 8.02 5.78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31
22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26
23 7.88 5.66 4.77 4.26 3.94 3.71 3.54 3.41 3.30 3.21
24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17
25 7.77 5.57 4.68 4.18 3.86 3.63 3.46 3.32 3.22 3.13
30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98
40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80
60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.72 2.63
120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47
6.64 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32
132
Applied Statistics 2019-20
The entries in Table V are values for which the area to their right under the chi-square distribution with
given degrees of freedom (the gray area in the figure) is equal to .
TABLE V VALUES OF 2
d.f. d.f.
0.2 05 0.2 01
1 3.841 6.635 1
2 5.991 9.210 2
3 7.815 11.345 3
4 9.488 13.277 4
5 11.070 15.086 5
6 12.592 16.812 6
7 14.067 18.475 7
8 15.507 20.090 8
9 16.919 21.666 9
10 18.307 23.209 10
11 19.675 24.725 11
12 21.026 26.217 12
13 22.362 27.688 13
14 23.685 29.141 14
15 24.996 30.578 15
16 26.296 32.000 16
17 27.587 33.409 17
18 28.869 34.805 18
19 30.144 36.191 19
20 31.410 37.566 20
21 32.671 38.932 21
22 33.924 40.289 22
23 35.172 41.638 23
24 36.415 42.980 24
25 37.652 44.314 25
26 38.885 45.642 26
27 40.113 46.963 27
28 41.337 48.278 28
29 42.557 49.588 29
30 43.773 50.892 30
133