0% found this document useful (0 votes)
31 views

GEA1000 Lecture Notes

Uploaded by

yichienlui03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

GEA1000 Lecture Notes

Uploaded by

yichienlui03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 157

Quantitative Reasoning with Data

GEA1000 Teaching Team

August 11, 2024


Preface

GEA1000 Quantitative Reasoning with Data is a module that aims to equip students
with essential data literacy skills to analyse data and make decisions under uncertainty.
It covers the basic principles and practice for collecting data and extracting useful in-
sights, illustrated in a variety of application domains. For example, when two issues are
correlated (e.g. smoking and cancer), how can we tell whether the relationship is causal
(e.g. smoking causes cancer)? How can we analyse categorical data? What about numer-
ical data? What about uncertainty and complex relationships? These and many other
questions will be addressed using data software and computational tools, with real-world
data sets.
The framework that we will be making reference to frequently in this course is the
PPDAC cycle.1 The figure below is a representation of the data problem-solving cycle,
“Problem, Plan, Data, Analysis and Conclusion.”

The PPDAC cycle is a well-established approach to statistical literacy which is relevant


to how we learn data literacy after the transformational change “big data” has had on
society.2 The main features of PPDAC are

ˆ (to) document the stages a person would undertake when solving a problem using
numerical evidence,
ˆ using data which they had collected themselves, or from existing (public) data sets,

ˆ (where) analysis methods can include machine learning algorithms, as well as more
traditional statistical techniques.

The following figure briefly describes what happens at each stage of the PPDAC cycle.3
1 Spiegelhalter, David. (2019). The Art of Statistics. Penguin/Pelican Books
2 Wolff, A. et al. (2016). Creating an Understanding of Data Literacy for a Data-driven Society. The Journal of
Community Informatics, 12(3), 9–26.
3 Spiegelhalter, David. (2019). The Art of Statistics. Penguin/Pelican Books
This set of notes is meant to follow the four chapters of the module closely. The topics
covered in the chapters are summarised below.

ˆ Chapter 1: Getting data. Data collection and sampling. Experiments and observa-
tional studies. Data cleaning and recoding. Interpreting summary statistics (mode,
mean, quartiles, standard deviation etc.)
ˆ Chapter 2: Categorical data analysis. Bar plots, contingency table, rates and basic
rules on rates. Association, confounders and Simpson’s Paradox.
ˆ Chapter 3: Dealing with numerical data. Univariate and bivariate data. Histograms,
boxplots and scatter plots. Correlation, ecological and atomistic fallacies and simple
linear regression.
ˆ Chapter 4: Statistical inference. Probability, conditional probability and indepen-
dence. Prosecutor’s fallacy, base rate fallacy and conjunction fallacy. Discrete and
continuous random variables. Interpreting confidence intervals. Hypothesis testing
and learning about population based on a sample. Simple simulation.

Exploratory data analysis (EDA) will be incorporated extensively into the content of
the module. Students will appreciate that even simple plots and contingency tables can
give them valuable insights about data. There will be an emphasis on using suitable real
world data sets as motivating examples to introduce content and through the process of
problem solving, elucidate techniques/materials in the syllabus.
Contents

Chapter 1 Getting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


Section 1.1 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . 1
Section 1.2 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Section 1.3 Variables and Summary Statistics . . . . . . . . . . . . . . . . . 8
Section 1.4 Summary Statistics - Mean . . . . . . . . . . . . . . . . . . . . . 10
Section 1.5 Summary Statistics - Variance and Standard Deviation . . . . . 13
Section 1.6 Summary Statistics - Median, quartiles, IQR and mode . . . . . 16
Section 1.7 Study Designs - Experimental Studies and Observational Studies 19
Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 2 Categorical Data Analysis . . . . . . . . . . . . . . . . . . . . . . . 35
Section 2.1 Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Section 2.2 Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Section 2.3 Two rules on rates . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Section 2.4 Simpson’s Paradox . . . . . . . . . . . . . . . . . . . . . . . . . 50
Section 2.5 Confounders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Chapter 3 Dealing with Numerical Data . . . . . . . . . . . . . . . . . . . . . 71
Section 3.1 Univariate EDA . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Section 3.2 Bivariate EDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Section 3.3 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . 87
Section 3.4 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Chapter 4 Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Section 4.1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Section 4.2 Conditional Probability and Independence . . . . . . . . . . . . 120
Section 4.3 Conjunction Fallacy, Base Rate Fallacy and Random Variables . 124
Section 4.4 Statistical Inference and Confidence Intervals . . . . . . . . . . . 129
Section 4.5 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . 137
Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Chapter 1

Getting Data

Section 1.1 Exploratory Data Analysis

Discussion 1.1.1 Data exists in our everyday life. As we flip through our newspapers each day, we
see evidence of data being used and many questions being asked about data that has been collected. In
other words, we see that research is becoming data driven and it is fast becoming necessary for one to be
proficient in reasoning quantitatively. The ability to investigate and make sense of a data set is a core
21st century skill that any undergraduate, regardless of discipline should acquire.
An online article in 2021 shows the following:

(Source: https://www.todayonline.com/singapore/fall-singapore-marriages-divorces-2020-amid-covid-
19-restrictions-uncertainty)
After reading the article, it is natural for one to ask questions on how the conclusion was arrived at.
What kind of data was collected that supported this conclusion? Is the conclusion made correctly?

Definition 1.1.2 A population is the entire group (of individuals or objects) that we wish to know
something about.

Definition 1.1.3 A research question is usually one that seeks to investigate some characteristic of a
population.

Example 1.1.4 The following are some examples of research questions.

1. What is the average number of hours that students study each week?

2. Does the majority of students qualify for student loans?

3. Are student athletes more likely than non-athletes to do final year projects?
2 Chapter 1. Getting Data

Broadly speaking, we can classify research questions into the following categories.
1. To make an estimate about the population.
2. To test a claim about the population.
3. To compare two sub-populations / to investigate a relationship between two variables in the pop-
ulation.

Example 1.1.5 Having a well designed research question is a critical beginning to any data driven
research problem. While an in-depth discussion on how research questions can be designed is beyond
the scope of this course, the following table gives a few examples and provides some insights into what
are some considerations and desirable features that good research questions should have.
Considerations Example of a neutral Example of a better Explanation
research question research question
Narrow vs. Q1: Do Primary Six Q2: Do Primary Six Q1 is too narrow
Less Narrow students have an av- students have an av- as it can be an-
erage sleep time of 7 erage sleep time of 7 swered with a sim-
hours a day? hours a day? What ple statistic. It
are some variables does not look at any
that may play a part other context sur-
in affecting the num- rounding the issue.
ber of hours they Q2 is less narrow and
sleep? attempts to go be-
yond simply finding
some data or num-
bers. It seeks to un-
derstand the bigger
picture too.
Unfocussed vs. Q1: What are the ef- Q2: How does eating Q1 is too broad
Focussed fects of eating more more than 2 meals of which makes it diffi-
than 2 meals of fast fast food per week af- cult to identify a re-
food per week? fect the BMI (Body search methodology.
Mass Index) of chil- Q2 is focussed and
dren between 10 to clear on what data
12 years old in Sin- to be collected and
gapore? analysed.
Simple vs. Q1: How are schools Q2: What are the Q1 is simple and
Complex in Singapore ad- effects of interven- such information can
dressing the issue of tion programs im- be obtained with a
mental health among plemented at schools search online with
school children? in Singapore on the no analysis required.
mental health among Q2 is more complex
school children aged and requires both in-
13 to 16? vestigation and eval-
uation which may
lead the research to
form an argument.
We will now proceed to describe the process of Exploratory Data Analysis (EDA).

Definition 1.1.6 Exploratory Data Analysis (EDA) is a systematic process where we explore a data set
and its variables and come up with summary statistics as well as plots. EDA is usually done iteratively
until we find useful information that helps us answer the questions we have about the data set.
In general, the steps involved in EDA are
Section 1.2. Sampling 3

1. Generate research questions about the data.

2. Search for answers to the research questions using data visualisation tools. In the process of
exploration, we could also perform data modelling (e.g. regression analysis).

3. We ask ourselves the following question: To what extent does the data we have, answer the questions
we are interested in?

4. We refine our existing questions or generate new questions about the data before going back to the
data for further exploration.

Section 1.2 Sampling

Definition 1.2.1 A population of interest refers to a group in which we have interest in drawing con-
clusions on in a study.

Definition 1.2.2 A population parameter is a numerical fact about a population.

Example 1.2.3 The following are some examples of a population and an associated population param-
eter.

1. The average height (population parameter) of all primary six students in a particular primary school
(population).

2. The median number of modules taken (population parameter) by all first year undergraduates in a
University (population).

3. The standard deviation of the number of hours spent on mobile games (population parameter) by
pre-schoolers aged 4 to 6 in Singapore (population).

Definition 1.2.4 A census is an attempt to reach out to the entire population of interest.

While it is obviously nice to have a census, this is often not possible due to the high cost of conducting
a census. In addition, some studies are time sensitive and a census typically takes a long time to complete,
even when it is possible to do so. Furthermore, in a census attempt, one may not be able to achieve
100% response rate.

Definition 1.2.5

1. It is usually not feasible to gather information from every member of the population, so we look at
a sample, which is a proportion of the population selected in the study.

2. Without the information from every member of the population, we will not be able to know exactly
what is the population parameter. The hope is that the sample will be able to give us a reasonably
good estimate about the population parameter. An Estimate is an inference about the population’s
parameter based on the information obtained from a sample.

3. A sampling frame is the list from which the sample was obtained. Note that since a census does
not involve sampling, the notion of sampling frame is not applicable in that context.

Remark 1.2.6
4 Chapter 1. Getting Data

1. Suppose the population of interest are people who drink coffee in Singapore. How should we
design a sampling frame for this population? The sampling frame may or may not cover the entire
population or it may contain units not in the population of interest. The all important question is
whether the sample obtained from such a sampling frame is still able to tell us something about
the population parameter. The following are some of the characteristics of the sampling frame that
we should pay attention to:

ˆ Does the sampling frame include all available sampling units from the population?
ˆ Does the sampling frame contain irrelevant or extraneous sampling units from another popu-
lation?
ˆ Does the sampling frame contain duplicated sampling units?
ˆ Does the sampling frame contain sampling units in clusters?

2. One of the conditions of generalisability, which is the ability to generalise the findings from a sample
to the population is that the sampling frame must be equal to or greater than the population of
interest. Note that this does not mean that when our sampling frame covers the entire population
of interest, our findings from the sample will always be generalisable to the population. It is
still an important question to know how the sample was collected. (See Remark 1.2.17 for more
information on the criteria for generalisability.)

Definition 1.2.7 When we sample from a population, we must try to avoid introducing bias into
our sample. A biased sample will almost surely mean that our conclusion from the sample cannot be
generalised to the population of interest. There are two major kinds of biases.

1. Selection bias is associated with the researcher’s biased selection of units into the sample. This can
be caused by imperfect sampling frame, which excluded units from being selected. Selection bias
can also be caused by non-probability sampling (see Definition 1.2.15 and Example 1.2.16).

2. Non-response bias is associated with the participants’ non-disclosure or non-participation in the


research study. This results in the exclusion of information from this group. There can be various
reasons for non-response, for example, inconvenience or unwillingness to disclose sensitive infor-
mation. Note that non-response bias may occur regardless of whether the sampling method is
probabilistic or non-probabilistic in nature.

Example 1.2.8

1. Suppose we would like to study the number of modules taken by all first year undergraduates in
a University. To collect a sample, the researcher went to two different lecture theatres to survey
undergraduates who were taking two different first year Engineering foundation (compulsory) mod-
ules. The sampling frame in this case consists of all undergraduates who were registered in the two
modules in the semester. Undergraduates who are not taking either of the two modules will not
have a chance to be sampled and thus the sampling frame is imperfect, leading to selection bias.

2. Suppose we would like to find out the proportion of students living at a boarding school who have
received some form of financial assistance in the past and if they had received financial assistance,
what was the quantum they received. A questionnaire was distributed to all students via a survey
form slipped under their room doors and instructions were given to them to complete the form
and drop it in a collection box if they had received financial assistance before. Students do not
need to return the form if they had not received any form of financial assistance previously. The
data collected from this is likely to be biased due to non-response as students who actually had
received financial assistance in the past may be reluctant to share this information or be seen by
their friends when they have to drop the form at the collection box. This will likely result in an
underestimate of the proportion of students who had received financial assistance.
Section 1.2. Sampling 5

Definition 1.2.9 Probability sampling is a sampling scheme such that the selection process is done via
a known randomised mechanism. It is important that every unit in the sampling frame has a known
non-zero probability of being selected but the probability of being selected does not have to be same
for all the units. The randomised mechanism is important as it introduces an element of chance in the
selection process so as to eliminate biases.

We will introduce four main types of probability sampling methods.

1. Simple random sampling (SRS) - this happens when units are selected randomly from the sampling
frame. More specifically, a simple random sample of size n consists of n units from the popula-
tion chosen in such a way that every set of n units has an equal chance to be the sample actually
selected. We are referring to sampling without replacement here, where a unit chosen in the sample
is removed and has no chance of being chosen again into the same sample. A useful way to perform
simple random sampling is to use a random number generator. While it is expected that differ-
ent samples sampled from the same sampling frame using SRS would be different, the variability
between the samples is entirely due to chance.

Example 1.2.10 The classic lucky draw that is carried out during dinners is the best example of
simple random sampling. In this case, every attendee has his/her lucky draw ticket placed inside a
box and a simple random sample of these tickets are drawn out of the box, one at a time, without
replacement. If we assume that before each draw, the remaining tickets in the box are mixed
properly such that every ticket has a equally likely chance of being drawn out, then the probability
of each ticket being drawn at any instance is n1 where n is the number of tickets remaining inside
the box.

Example 1.2.11 Suppose we would like to sample 500 households in Singapore and find out how
many household members there are in each household. Let us assume that every household has a
unique home phone number. If we have a listing of all such phone numbers and list them from 1
to n, we can use a random number generator to select 500 phone numbers from the list to form
our sample. Unique phone calls (i.e. sampling without replacement) can then be made to these
households to survey the number of household members. This is another example of simple random
sampling. Notice that this example also illustrates a common shortcoming of SRS, in that it can
possibly be subjected to non-response from the units that are sampled.

2. Systematic sampling is a method of selecting units from a list by applying a selection interval k
and a random starting point from the first interval. To carry out systematic sampling:

(a) Suppose we know how many sampling units there are in the population (denoted by p);
(b) We decide how big we want our sample to be (denoted by n). This means that we will select
one unit from every k = np units;
p
(c) from 1 to k = n, select a number at random, say r;

With this, the sample will consist of the following units from the list:

r, r + k, r + 2k, · · · , r + (n − 1)k.

However, it is often that we do not know the number of sampling units p in the population. In
such a situation, systematic sampling can still be done by deciding on the selection interval k and
randomly selecting a unit from the first k units and then subsequently every kth unit will be
sampled. For example, if k = 10, we can sample the 5th , 15th , 25th units and so on.

Compared to simple random sampling, systematic sampling is a simpler sampling process as we do


not need to know how many sampling units there are exactly. On the other hand, if the listing is
not random, but instead contains some inherent grouping or ordering of the units, then it is possible
that a sample produced by systematic sampling may not be representative of the population.
6 Chapter 1. Getting Data

Example 1.2.12 Suppose we know there are 110 sampling units in the population (so p = 110)
and we would like to select a sample with 10 units (so n = 10). Imagine the sampling units are
numbered 1 to 110 in a list and arranged according to the table below.

1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59 60
61 62 63 64 65 66 67 68 69 70
71 72 73 74 75 76 77 78 79 80
81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100
101 102 103 104 105 106 107 108 109 110

Since p = 110 and n = 10, we select one unit from every k = 110 10 = 11 units. So we randomly
select a number from 1 to 11 which will start off the sampling process. For example, if the number
selected was 5, then our sample will comprise of the elements

5, 16, 27, 38, 49, 60, 71, 82, 93, 104.

Similarly, if the number selected was 9, then our sample will comprise of the elements

9, 20, 31, 42, 53, 64, 75, 86, 97, 108.

From this example, it should be clear that if the sampling units are listed with some inherent
pattern, then it is possible that the sample obtained could have selection bias.

3. Stratified sampling is a method where the sampling frame is divided into groups called strata. Each
stratum is similar in that they share similar characteristics but the size of each stratum does not
necessarily have to be the same. We then apply simple random sampling to each stratum to generate
the overall sample. While stratified sampling is a commonly used probability sampling method,
there are some situations where it may not be possible to have information on the sampling frame
of each stratum in order to perform simple random sampling properly. Furthermore, depending on
how the strata are defined, we may face ambiguity in determining which stratum a particular unit
belongs to. This can complicate the sampling process.

Example 1.2.13 An example of stratified sampling can be seen during elections, for example,
a Presidential Election. Voters visit their designated polling stations to cast their votes for the
candidate that they wish to support. In countries where the number of voters is very large, it may
take a long time before all the votes are counted. Stratified sampling can be employed if we wish
to make a reasonably good prediction of the outcome. This is done by taking a simple random
sample of the voters at each polling station (stratum) and then computing the weighted average of
the overall vote count, based on the size of each stratum, for each candidate. This way, we would
be able to have a reasonably good estimate of the total votes each candidate would receive.

4. Cluster sampling - is a method where the sampling frame is divided into clusters. A fixed number
of clusters are then selected using simple random sampling. All the units from the selected clusters
are then included in the overall sample. One advantage of this sampling method is that it is
usually simpler, less costly and not as resource intensive than other probability sampling methods.
The clusters are usually naturally defined which makes it easy to determine which cluster a unit
belongs to. The main disadvantage of this sampling method is that depending on which clusters
are selected, we may see high variability in the overall sample if there are largely dissimilar clusters
with distinct characteristics. In addition, if the number of clusters sampled is small, there is also
a risk that the clusters selected will not be representative of the population.
Section 1.2. Sampling 7

Example 1.2.14 Suppose a study wants to survey the mental wellness of Primary school students
in Singapore. Cluster sampling can be done by treating each Primary school as a cluster and this
way of clustering the population of interest is natural and unambiguous since all students in the
population belongs to exactly one Primary school. A number of schools are then selected using
simple random sampling for this survey and all the students in the selected schools will be part
of the sample while those not in the selected schools will not be included. Another approach is of
course to apply simple random sampling with the list of all students (from all Primary schools) as
the sampling units. If this was done, then there is a possibility that all schools will have students
forming part of the sample. Cluster sampling would not provide such a characteristic.

We have presented four different probability sampling methods, below is a summary table of the
advantages and disadvantages of the methods.

Sampling Plan Advantages Disadvantages


Time-consuming;
Simple Random Good representation of
accessibility of information
Sampling the population
and sampling frame
Simple selection process Potentially
Systematic
as opposed to simple under-representing the
Sampling
random sampling population
Require sampling frame and
Good representation of
Stratified Sampling criteria for classification of
the sample by stratum
the population into stratum
Require clusters to be
Less time-consuming and reasonably heterogeneous
Cluster Sampling
less costly and not have cluster-specific
characteristics

Remember:
There is no single universally best probability sampling method as each has its advan-
tages and disadvantages. All probability sampling methods can produce samples that
are representative of the population (that is, sample is unbiased). However, depending
on the situation, some methods would further reduce the variability, resulting in a more
precise sample.

Definition 1.2.15 A non-probabilty sampling method is when the selection of units is not done by
randomisation. There is no element of chance in determining which units are selected, instead it is
usually down to human discretion.

Example 1.2.16

1. Convenience sampling is a non-probability sampling method where a researcher chooses subjects


to form a sample among those that are most easily available to participate in the study. A common
occurrence of convenience sampling is at shopping malls where surveyors approach shoppers at a
location convenient to them. Such a sampling method introduces selection bias since malls are
frequently visited by those who are more affluent. Other demographics of the population may
be left out. Another issue that may arise from convenience sampling done at shopping malls is
non-response bias as shoppers may not want to be stopped for questionnaires as they feel it is time
consuming and not what they are meant to be doing in a mall.

2. Volunteer sampling happens when subjects volunteer themselves into a sample. Such a sample is
also known as a self-selected sample and very often, the sample contains subjects who have a strong
opinion (either positive or negative) on the research question than the rest of the population. Such
8 Chapter 1. Getting Data

a sample is unlikely to be representative of the population of interest. For example, the host of a
“popular” radio talkshow may wish to find out how well received is his show. To do this, he asked
his listeners to go online and submit a rating of this show, out of a score of 10. Each listener can
voluntarily decide if they wish to be part of this rating exercise or not. By collecting a sample
of opinions this way, it is likely that the sample will be skewed towards a high rating because
listeners who did not like the talkshow would not even be aware of such a survey and therefore
their opinions would have been left out. On the other hand, listeners who are strong supporters of
this show would be more enthusiastic to go online to support their favourite radio show.

Let us summarise our discussion on sampling. In most instances where a census is not possible,
obtaining a sample of the population of interest is necessary. The following outlines the general approach
to sampling:

1. To design a sampling frame. Recall that a sampling frame should ideally contain the population
of interest so that every unit in the population has a chance to be sampled.

2. Decide on the most appropriate sampling method to generate a sample from the sampling frame.
Probability sampling methods are generally preferred over non-probability sampling methods as
non-probability sampling methods have a tendency to generate a biased sample.

3. Remove unwanted units (those that are not from the population) from the generated sample.

Remark 1.2.17 If the following generalisability criteria can be met, we will be more confident in
generalising the conclusion from the sample to the population.

1. Have a good sampling frame that is equal to or larger than the population;

2. Adopt a probability-based sampling method to minimise selection bias;

3. Have a large sample size to reduce the variability or random errors in the sample;

4. Minimise the non-response rate.

Section 1.3 Variables and Summary Statistics

Definition 1.3.1

1. A variable is an attribute that can be measured or labelled.

2. A data set is a collection of individuals and variables pertaining to the individuals. Individuals can
refer to either objects or people.

In a research question where we are examining relationships between variables, there is usually a
distinction between which are independent and which are dependent variables.

Definition 1.3.2

1. Independent variables are those that may be subjected to manipulation, either deliberately or
spontaneously, in a study.

2. Dependent variables are those that are hypothesised to change depending on how the independent
variable is manipulated in the study.
Section 1.3. Variables and Summary Statistics 9

It is important to note that the dependent variable is hypothesised to change when the independent
variable is manipulated. It does not mean that the dependent variable must change. It is perfectly
possible that any changes to the independent variable does not result in any change in the dependent
variable.

Example 1.3.3

1. In a study, if we wish to investigate the relationship between time spent on computer gaming and
examination scores, the independent variable is the amount of time one spends on computer gaming
while the dependent variable is the examination score.

2. In a study where we investigate which brand of tissue paper is able to absorb the most water, the
independent variable is the brand of the tissue paper and the dependent variable is the amount of
water a piece of tissue paper (from a particular brand) can absorb. In this study, we will vary the
different brands of tissue paper used and record the different amounts of water absorbed.

3. We would like to study whether drinking at least 2 glasses of orange juice per day for a year is asso-
ciated 1 with having lower cholesterol levels in a year’s time. In this case, the independent variable
is whether (or not) a person drinks at least 2 glasses of orange juice a day. Each individual will
have an attribute labelled either as “YES” or “NO” with regards to this variable. The dependent
variable would be whether an individual’s cholesterol level next year is lower than this year’s level.
Again, each individual will have an attribute labelled either as “YES” or “NO” with regards to
this variable.

Definition 1.3.4

1. Categorial variables are those variables that take on categories or label values. The categories
or labels are mutually exclusive, meaning that an observation cannot be placed in two different
categories or given two different labels at the same time.

2. Numerical variables are those variables that take on numerical values and we are able to meaning-
fully perform arithmetic operations like adding and taking average.

3. Among categorical variables, there are generally two sub-types. An ordinal variable is a categorical
variable where there is some natural ordering and numbers can be used to represent the ordering.
A nominal variable is a categorial variable where there is no intrinsic ordering.

4. Among numerical variables, there are also generally two sub-types. A discrete numerical variable
is one where there are gaps in the set of possible numbers taken on by the variable.

5. A continuous numerical variable is one that can take on all possible numerical values in a given
range or interval.

Example 1.3.5

1. The happiness index used to measure how happy a group of Secondary school students are, is an
ordinal variable. For instance, we can specify “1” as “not happy”, “2” as ‘somewhat not happy’,
“3” as neutral, “4” as “somewhat happy” and “5” as “happy”. Whether a subject drinks at least
2 glasses of orange juice or not is an example of a nominal variable.

2. The number of children in the school who scored an A grade in Mathematics for PSLE is a discrete
numerical variable. In this case, the gaps are the non integer values that lie between every two
integer values. It is clear that we cannot have, for example, 134.5 children scoring A in the school,
so there is a gap between 134 and 135.
1 The notion of association between variables will be discussed extensively in Chapter 2.
10 Chapter 1. Getting Data

3. The height or the weight of a person is a continuous numerical variable, as the weight can take on
all numerical values, not necessarily only the integer values.
A common way of presenting data is to use a table with rows and columns. Each row of the table
usually gives information pertaining to a particular individual while each column is a variable. So if we
look across a row in the table, we will see the variables’ information for that particular individual.

The table above shows part of a data set involving different species of penguins and some of the
physical attributes of the penguins. Each row represents a particular penguin and the columns are the
variables pertaining to that particular penguin. Some of the variables are categorical variables while
others are numerical. Can you figure out whether the categorial variables are ordinal or nominal? Can
you figure out whether the numerical variables are discrete or continuous?
With a data set, we are able to zoom into a particular individual’s information at a micro level. If we
do this, we can extract all the information on that particular individual for our use. However, we may
also be interested in looking at the entire data set at the macro level, obtaining information on groups of
individuals or the entire population. Useful information like trends and patterns can be observed from
the data through data visualisation, which is very useful. While calculations cannot be done through
visualisations, we can use summary statistics to do numerical and quantitative comparisons between
groups of data.
Summary statistics for numerical variables can be broadly classified into two types. Firstly, there
are those that measure the central tendencies of the data, like mean, median and mode. Secondly,
there are those that measure the level of dispersion (or spread) of the data, like standard deviation and
interquartile range.

Section 1.4 Summary Statistics - Mean

Definition 1.4.1 The mean is simply the average value of a numerical variable x. We denote the mean
of x by x and the formula to compute x is
Pn
x1 + x2 + · · · + xn xi
x= = i=1 .
n n
Here, n is the number of data points and x1 , x2 , . . . , xn are the numerical values of the numerical variable
x in the data set.

Example 1.4.2 Suppose the bill length (in mm) of 7 penguins were

46.9, 36.5, 36.4, 34.5, 33.1, 38.6, 43.2.

Then the mean bill length is


46.9 + 36.5 + 36.4 + 34.5 + 33.1 + 38.6 + 43.2
7
Section 1.4. Summary Statistics - Mean 11

which is approximately 38.46 (rounded to 2 decimal places).

Remark 1.4.3 These are some properties of the mean of a variable.

1. x1 + x2 + · · · + xn = nx. This means that we may not know each of the individual values
x1 , x2 , . . . , xn , but we can calculate their sum if we know their mean (x) and the number of data
points (n) that is used to compute the mean.

2. Adding a constant value c to all the data points changes the mean by that constant value. So if
the mean of the values x1 , x2 , . . . , xn is x, then the mean of

x1 + c, x2 + c, . . . , xn + c

will be x+c. For example, the mean of 1, 6, 8 is 13 (1+6+8) = 5 and the mean of (1+3), (6+3), (8+3)
(adding 3 to each of the 3 numbers 1, 6 and 8) is

(1 + 3) + (6 + 3) + (8 + 3) 4 + 9 + 11
= = 8 = 5 + 3.
3 3

3. Multiplying a constant value of c to all the data points will result in the mean being changed by
the same factor of c. So if the mean of the values x1 , x2 , . . . , xn is x, then the mean of

cx1 , cx2 , . . . , cxn

will be cx. For example, the mean of 2, 7, 12 is 13 (2 + 7 + 12) = 7 and the mean of (2 × 2), (2 ×
7), (2 × 12) (multiplying 2 to each of the 3 numbers 2, 7 and 12) is

(2 × 2) + (2 × 7) + (2 × 12) 42
= = 14 = 2 × 7.
3 3

We will now look at several examples of means in real life.

Example 1.4.4 Consider a data set where we have daily weather data, collected at various weather
stations in Singapore. Part of the data set is shown below.

With this data set, some of the questions that we can ask are

1. Which month in 2020 had the most amount of rainfall?

2. If the mean monthly rainfall in 2020 was 157.22mm, what was the total amount of rainfall recorded
in 2020?

3. Is there any relationship between wind speed and temperature? What about between the amount
of rainfall and wind speed?

4. Does the weather pattern for 2020 allow us to make a good prediction for how the weather will be
like in 2021?
12 Chapter 1. Getting Data

To answer the first question on the month with the most amount of rainfall, we need to add up the amount
of rainfall recorded on each day of a month, for every month in the year in order to do a comparison.
To answer the second question, using the information on the average rainfall (x = 157.22), with the fact
that
12x = x1 + x2 + · · · + x12 ,
we can find the total rainfall in 2020 to be 12 × 157.22 = 1866.64mm. This way, we can find the total
rainfall in 2020 without having to add the total amount of rainfall for each of the twelve months. It is
also useful to note that if the average rainfall in 2020 was 157.22mm, then
1. It is not possible for the amount of rainfall to be less than (or more than) 157.22mm every month
in 2020.
2. It is not necessarily the case that the amount of rainfall is 157.22mm every month in 2020.
3. In fact, it may not even be the case that there were six (half of twelve) months where the monthly
rainfall were higher than the mean and the other six months lower than the mean.
In conclusion, knowing the mean, while useful, does not tell us how the rainfall was distributed over the
twelve months of 2020. We would not know which months had more than the mean and which months
had less. In order to have further information beyond the mean, we need to know a bit more about the
spread of the data. This will be covered later in this chapter.

Example 1.4.5 Suppose students from two different schools (A and B) took a common examination
and the table below shows the average performance of the students in both schools.
No. of students Average mark
School A 349 32.21
School B 46 30.72
Overall 395 ?
The mean score of students in school A was 32.21 and the mean score of students in school B was
30.72. What would be the mean score of all the students in both schools if we consider them altogether?
Would it be the simple average
(32.21 + 30.72)
= 31.465?
2
The answer is no and the reason for this is because we do not know how many students in each school
contributed to the mean sores recorded in their respective schools. Imagine the extreme case where
school A had 500 students who took the examination while school B only had 5. In such a situation,
you would expect that the overall average score of the 505 students in both schools to be very close to
the mean score of school A. In order to know what is the overall mean for the students in both schools,
we need to have the information on the number of students in each school, given below.
Number of students
School A 349
School B 46
With this information, the overall mean can be computed using the weighted average of the two subgroup
means. The overall mean for the 349 + 46 = 395 students would be
349 46
× 32.21 + × 30.72 = 32.04.
395 395
349 46
The numbers 395 and 395 that were multiplied to their respective group means are called the weights
of the subgroups. Observe that due to the much larger subgroup size of school A compared to that of
school B, the overall mean as we expected, is much closer to the mean of school A.
Another useful observation is that the overall mean of 32.04 lies between the two subgroup means
of 32.21 and 30.72 (although closer to 32.21). This is not a one-off coincidence. Generally, the overall
mean will always be between the smallest and largest means among all the subgroups (when we have
more than just two subgroups). This will be discussed in greater detail in the next chapter.
Section 1.5. Summary Statistics - Variance and Standard Deviation 13

Example 1.4.6 In this final example on means, we introduce a related concept known as proportions.
Suppose we would like to investigate the effectiveness of a new drug for treating asthma attacks compared
to existing drugs. The table below shows the number of patients taking the new drug and the number
taking the existing drug.

New drug Existing drug


Number of patients 500 1000
Total asthma attacks 200 300

Since there are only 200 asthma attacks among those patients taking the new drug, compared to
300 asthma attacks among those taking the existing drug, can we conclude that the new drug is more
effective? The answer is no. Notice that the number of patients taking the new drug and those taking
the existing drug are vastly different. This means that we should not be simply looking at the absolute
number of asthma attacks observed in the two groups of patients, but instead consider the proportion of
patients in each group having asthma attacks. We see that the proportion is higher in the group taking
the new drug compared to the group taking the existing drug and this makes us a lot less confident that
the new drug is more effective than the existing one.

New drug Existing drug


Number of patients 500 1000
Total asthma attacks 200 300
Proportion of patients 200 300
500 = 0.4 1000 = 0.3
having asthma attacks

The computation of proportion can actually be thought of as a mean in the following way. Imagine
that among the 500 patients receiving the new drug, we assign a numerical value of 1 to those who had
an asthma attack after the taking the new drug and a numerical value of 0 to those patients who did not
have an asthma attack. If we do this, then the mean of these 500 observations of 0s and 1s would be
200 300
z }| { z }| {
1 + 1 + ··· + 1+0 + 0 + ··· + 0
= 0.4,
500
which coincides with what was computed as the proportion for this group of patients having asthma
attack. Therefore, proportion can be thought of as a special case of mean.

Section 1.5 Summary Statistics - Variance and Standard Devi-


ation

Definition 1.5.1 Recall that in Example 1.4.4, we saw that knowing the mean of a variable does not
tell us about how the data is distributed and the spread of the data points. Standard deviation is one of
the ways to measure the spread of the data about the mean. The computation of the standard deviation
is done via the computation of the sample variance of the data as follows:

(x1 − x)2 + (x2 − x)2 + · · · + (xn − x)2


Sample Variance, Var = ;
n−1

Standard Deviation, sx = Var.
Here, x1 , x2 , · · · , xn are n observations of the variable x while x is the mean.
You may wonder at this point why do we need to compute the square of the difference between
each observation xi and the mean x before proceeding to sum up these differences for all the n data
points? Why can’t we compute (xi − x) instead of (xi − x)2 ? Consider a set of 5 data points as follows:
14 Chapter 1. Getting Data

x1 = 1, x2 = 3, x3 = 5, x4 = 7, x5 = 9. This would result in the mean being x = 51 (1 + 3 + 5 + 7 + 9) = 5.


There is clearly a spread of the data points about the mean value of 5. However, if we were to consider

(x1 − x) + (x2 − x) + (x3 − x) + (x4 − x) + (x5 − x)


= (1 − 5) + (3 − 5) + (5 − 5) + (7 − 5) + (9 − 5) = 0;

this would result in the wrong conclusion that there is no variance (and thus no spread) of the data points
about the mean. The reason is simply because each data point could be smaller or bigger than the mean
and if the differences (xi − x) are not squared, they may cancel out each other like in the example above,
giving us the wrong impression that there is no variation or spread among the data points about the
mean.

Remark 1.5.2 You may wonder why, in the computation of sample variance, we divide the sum of the
squares (xi − x)2 by n − 1 instead of n, since we have n data points and not n − 1. The reason is because
x1 , x2 , . . . , xn are assumed to be a sample taken from a population. We are using the variance observed
in such a sample to estimate the variance at the population level, which is usually unknown. You can
think of dividing by n − 1 instead of n as a ‘correction’ to make since our data is only a sample of the
population. More detailed discussion on this is beyond the scope of this module.

Example 1.5.3 The highest temperature recorded on the 1st day of every month is shown below:

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
30.1 31.1 31.8 32.1 31.9 32.6 33.0 32.4. 32.0 32.5 31.3 29.6

The mean is
30.1 + 31.1 + 31.8 + 32.1 + 31.9 + 32.6 + 33.0 + 32.4 + 32.0 + 32.5 + 31.3 + 29.6
= 31.7.
12
The sample variance is
1
(30.1 − 31.7)2 + (31.1 − 31.7)2 + · · · + (31.3 − 31.7)2 + (29.6 − 31.7)2 ≈ 1.038

Var =
11
The standard deviation is √
sx = Var ≈ 1.019.

Remark 1.5.4 The following are some properties of the standard deviation of a variable x.

1. The standard deviation sx is always non negative. In fact, sx is almost always positive and the
only instance when sx = 0 is when the data points are all identical, that is, x1 = x2 = · · · = xn .
In this case, the variance is zero and so is the standard deviation.

2. The standard deviation shares the same unit as the numerical variable x. For example, if x measures
the weight (in kilograms) of adult males in Singapore, then the unit for sx is also kilograms.

3. Adding a constant c to all data points does not change the standard deviation. So the standard
deviation for the set of data
A = {x1 , x2 , x3 , . . . , xn }
is the same as the standard deviation for the set of data

B = {x1 + c, x2 + c, x3 + c, . . . , xn + c}.

Intuitively, since all the data points are adjusted by the same constant c, the spread of the data
points about the new mean will be the same as the spread of the original data about the previous
mean.
Section 1.5. Summary Statistics - Variance and Standard Deviation 15

4. Multiplying all the data points by a constant c results in the standard deviation being multiplied
by |c|, the absolute value of c. In other words, if sx is the standard deviation for the set of data
A = {x1 , x2 , x3 , . . . , xn },
then the standard deviation for the set of data
B = {cx1 , cx2 , cx3 , . . . , cxn }
will be |c|sx .

Example 1.5.5 Let us return to the data set involving three different species of penguins introduced
earlier in the chapter. The three species were named Chinstrap, Adelie and Gentoo and the data set
contained information on the physical attributes (e.g. mass, bill length, bill depth etc.) of various
penguins in each of the three species. An overarching question that one may be interested to answer is
- how different are these penguins? A common approach to answer this question is to compare those
physical attributes across samples collected for the different species and see if they are significantly
different. For example, we can compute the mean and standard deviation of the mass of the penguins,
summarised as follows:
Mean mass Standard deviation of mass
Chinstrap 3733g 384.3g
Adelie 3710g 458.6g
Gentoo 5076g 504.1g
Overall 4201g 802.0g
1. Observe that the overall mean mass 4201g is indeed between the group with the highest mean mass
(Gentoo) at 5076g and the group with the lowest mean mass (Adelie) at 3710g. This is consistent
with our earlier discussion.
2. Even though the overall mean mass is 4201g with standard deviation 802g, it does not imply that
the heaviest penguin weighs 4201 + 802 = 5003g.
3. Suppose we wish to investigate whether the Adelie and Chinstrap species are similar in terms of
their mass. First, we observe that the mean mass of these two groups are rather similar with the
Adelie species having a mean mass of 3710g while the Chinstrap species has a mean mass of 3733g.
However, the standard deviation of mass for these two species are rather different.
4. To examine further on the difference in physical attributes between the Adelie and the Chinstrap
species, we need to delve into other factors or variables that we have information on from the data
set, for example, variables like age, gender, location and so on. This is Exploratory Data Analysis
in action, where we start off with a few questions about the data set and with exploration into
the data, we ask new questions and go back to the data set to look more closely at the data in an
attempt to answer the new questions. In data analysis, this process is often repeated several times.
In relation to this penguin data set, here are some further questions that can be asked:
ˆ Are male penguins heavier than female penguins?
ˆ Is there a relationship between bill length and bill depth across all species?
ˆ Do heavier penguins come from colder locations?
ˆ Can findings in this data be generalised to all of the three species?
5. The concept of coefficient of variation is often used to quantify the degree of spread relative to the
mean. The formula is
sx
coefficient of variation = .
x
Observe that since sx and x have the same units, the coefficient of variation has no units and
is simply a number. The coefficient of variation is a useful statistic for comparing the degree of
variation across different variables within a data set, even if the means are drastically different
from one another.
16 Chapter 1. Getting Data

Section 1.6 Summary Statistics - Median, quartiles, IQR and


mode

Definition 1.6.1 In this section, we will introduce a few other summary statistics. We have already
discussed the mean, which measures the central tendencies of a variable, as well as standard deviation
which measures the spread of the data points about the mean. The median of a numerical variable in
a data set is the middle value of the variable after arranging the values of the data set in ascending or
descending order. If there are two middle values (when there are an even number of data points), we
will take the average of the two middle values as the median. The median is an alternative to the mean
as a measure of central tendencies of a numerical variable.

Example 1.6.2 After arranging the following 12 numbers


6, 12, 5, 10, 11, 18, 9, 4, 12, 11, 3, 13
in increasing order, we have
3, 4, 5, 6, 9, 10, 11, 11, 12, 12, 13, 18.
The median is the average of the sixth (10) and seventh (11) numbers in the order, which is 10.5.

Remark 1.6.3
1. We have seen that when a constant c is added to every data point in a data set, the mean will also
be increased by c. The median behaves in the same way, so if the median of the values x1 , x2 , . . . , xn
is r, then the median of
x1 + c, x2 + c, . . . , xn + c
is r + c.
2. We have also seen that when a constant c is multiplied to all the data points, then the mean is also
multiplied by c. The effect on the median is similar, so if the median of the values x1 , x2 , . . . , xn is
r, then the median of
cx1 , cx2 , . . . , cxn
is cr.

Example 1.6.4 Returning to Example 1.4.5 we saw that school B had 46 students who took an exam-
ination and the mean of their scores was 30.72. The plot below, known as a dot plot shows the scores
obtained by each of the 46 students.

Each dot placed on a particular number represents a student obtaining that score for the examination.
Since there were 46 students, the median score would be the average of the 23rd and 24th ranked students’
scores. The 23rd ranked student scored 30 marks while the 24th ranked student scored 31 marks. So the
median score is 30.5. This also means that 50% of the students scored below 30.5 marks and the other
50% scored more than 30.5 marks.
It is interesting to note that the mean score for school B was 30.72, which is very close to the median
score. The main reason for this is because the spread of the scores are quite symmetrical about the mean
and the median. Can you construct a data set where the mean and median are far apart?
We can also compute the median score for students in school A, as well as the overall median score
when we combine the students from both schools together. The median and mean (computed in Example
1.4.5) for each subgroup as well as the overall median and mean scores are shown in the table below.
Section 1.6. Summary Statistics - Median, quartiles, IQR and mode 17

Median score Mean score


School A 32 32.21
School B 30.5 30.72
Combine schools A and B 32 32.04

Similar to what we observed for means, the overall median score (32) lies between the subgroup with
the higher median (32) and subgroup with the lowest median (30.5). This is by no means a coincidence.
Even when there are more than 2 subgroups, the overall median will always be between the lowest median
and the highest median among all the subgroups. However, if we know each of the subgroup medians,
it is not possible to use this information to derive the overall median. This is unlike the case for mean
where, if we know the mean of each subgroup, together with the “weights” of each group (meaning the
number of members in each subgroup) we can take a weighted average to compute the overall mean
exactly.

Definition 1.6.5 We have seen that the median represents a numerical value where 50% of the data is
less than or equal to this value. This is also known as the 50th percentile of the data values. The first
quartile, denoted by Q1 , is the 25th percentile of the data values, while the third quartile, denoted by
Q3 is the 75th percentile of the data values. This means that 25% of the data is less than or equal to Q1
while 75% of the data is less than or equal to Q3 .

Definition 1.6.6 The interquartile range, denoted by IQR is the difference between the third and first
quartiles, so IQR = Q3 − Q1 .

Remark 1.6.7

1. IQR and standard deviation share similar properties. For example, we know that IQR is always
non negative since Q3 is always at least as large as Q1 and so Q3 − Q1 ≥ 0.

2. If we add a positive constant c to all the data points, not only does the median value increase by c,
Q1 and Q3 are increased by c as well. Thus, there will be no change in IQR. Of course, IQR also
remains unchanged if c is subtracted from all data points.

3. If we multiply all data points by a constant c, then IQR will be multiplied by |c|.

Example 1.6.8 Let us consider two simple data sets and compute the first quartile, median, third
quartile and interquartile range. The first data set consists of an even number of data points as follows:

16, 30, 5, 1, 9, 22, 19, 8, 10, 28.

We arrange these 10 data points in increasing order:

1, 5, 8, 9, 10, 16, 19, 22, 28, 30.

1. Since there are 10 data points, the median is the average of the 5th and 6th ranked data points, so
median is 21 (10 + 16) = 13.

2. To find the first and third quartiles, we divide the data set into the lower half (1st to 5th ranked
data points) and upper half (6th to 10th ranked data points). The first quartile is the median of
the lower half
1, 5, 8, 9, 10,
which is the 3rd ranked data point in this lower half, so Q1 = 8. The third quartile is the median
of the upper half
16, 19, 22, 28, 30,
which is the 3rd ranked data point in this upper half, so Q3 = 22.
18 Chapter 1. Getting Data

3. The interquartile range is Q3 − Q1 = 22 − 8 = 14.


Let us consider the second data set which consists of an odd number of data points as follows:
5.6, 1.5, 3.3, 8.7, −3.1, 9.2, 15.5, 2.6, 11.5.
We arrange these 9 data points in increasing order:
−3.1, 1.5, 2.6, 3.3, 5.6, 8.7, 9.2, 11.5, 15.5.
1. Since there are 9 data points, the median is the 5th ranked data point, so median is 5.6.
2. To find the first and third quartiles, we divide the data set into the lower half (1st to 4th ranked
data points) and the upper half (6th to 9th ranked data points). Note that we have not included
the median in both lower and upper halves. The first quartile is the median of the lower half
−3.1, 1.5, 2.6, 3.3,
which is the average of 1.5 and 2.6, so Q1 = 2.05. The third quartile is the median of the upper
half
8.7, 9.2, 11.5, 15.5,
which is the average of 9.2 and 11.5. So Q3 = 10.35.
3. The interquartile range is Q3 − Q1 = 10.35 − 2.05 = 8.3.

Remark 1.6.9
1. In the example above, when the data set has odd number of data points, we have not included the
median in both the lower and upper halves. This is not the universal practice. You may encounter
some texts that includes the median in both halves.
2. In reality, when the number of data points is large, summary statistics like median and quartiles are
not computed manually but instead, they are computed using softwares. However, even softwares
do not adopt the same algorithm in computing these statistics. The good news is that we do not
have to worry too much about finding the exact value of the quartile since for large data sets, all
the different methods give pretty close answers and the small difference is not an issue. For small
data sets, it is also not really meaningful to summarise the data since we have complete information
of the entire data set anyway.

Remark 1.6.10 For a numerical variable, we can always use the mean and standard deviation as a pair
of summary statistics to describe the central tendency as well as the dispersion and spread of the data.
Similarly, the median and IQR can also be used. Which choice is more appropriate? There is no clear
cut answer but very often, the choice depends on the distribution of the data. Generally speaking, the
median and IQR is preferred if the distribution of the data is not symmetrical or when there are outliers.
We will conclude this section with a final summary statistic that can be used for both numerical and
categorical variables.

Definition 1.6.11 The mode of a numerical variable is the numerical value that appears most often
in the data. For categorical data, a mode is the category that has the highest occurrence in the data.
The mode is generally interpreted as the peak of the distribution and this means that the mode has the
highest probability of being observed if a data point is to be selected randomly from the entire data set.

Example 1.6.12 In the following set of numbers,


11, 12, 5, 10, 11, 11, 9, 4, 12, 11, 3, 13,
the mode is 11, since it appears 4 times, the highest among all the other numbers.
Section 1.7. Study Designs - Experimental Studies and Observational Studies 19

Section 1.7 Study Designs - Experimental Studies and Obser-


vational Studies
Recall that we introduced three types of research questions earlier in the chapter.

1. To make an estimate about the population.

2. To test a claim about the population.

3. To compare two sub-populations / to investigate a relationship between two variables in the pop-
ulation.

In this section, we will focus on the third type of question, where we investigate a relationship
between two variables in the population. For example, consider the question “does drinking coffee help
students pass the mathematics examination?” The two variables here are drinking coffee (yes or no)
and passing the mathematics examination (yes or no). Here, both variables are nominal categorical
variables. Commonly, a researcher looking at this situation may want to define “drinking coffee” as the
independent variable as it can be controlled and adjusted while “passing the mathematics examination”
is the dependent variable. In order to investigate this relationship, we need to design a study and for this
course, we will discuss two main study designs, namely experimental studies and observational studies.

Definition 1.7.1 In an experimental study (sometimes also known as controlled experiment or simply
an experiment), we intentionally manipulate one variable (the independent variable) to observe whether
it has an effect on another variable (the dependent variable). The primary goal of an experiment is to
provide evidence for a cause-and-effect relationship between two variables.

Example 1.7.2 Returning to the experiment to investigate the relationship between drinking coffee and
passing the mathematics examination, we can set up an experimental study by dividing the subjects,
that is, the students taking the examination, into two groups. The first group will be required to drink
exactly one cup of coffee every day for a month. The second group will not drink any coffee for one
month. The group who are required to drink one cup of coffee every day for a month is often known as
the treatment group since they are thought to be put through the “treatment” of drinking coffee. The
other group who does not drink coffee is known as the control group.
It is important to have a control group to compare against the treatment group. Without a control
group (imagine every subject is required to drink coffee for a month), we would not be able to determine if
there were indeed any difference between drinking coffee or not. However, it should be noted at this point
that sometimes the control group are also subjected to other forms of treatment (not to be mistaken with
the treatment of interest in the study) i.e. a control group does not necessarily mean no treatment at all.
One example is when we are comparing the effects of a new treatment with an existing treatment.
For such instances, the treatment group will be formed by subjects receiving the new treatment while
the control group will be those who continue to receive the existing treatment. Note also that if we were
to design a controlled experiment where the treatment group receives a new treatment, and the control
group receives the existing treatment, it is implicitly assumed, as part of the experimental design, that
we must know the effect of having no treatment. This ensures a meaningful comparison of the effects
between the new treatment and existing treatment with reference to the known baseline of having no
treatment.
A natural question now is how the subjects are to be divided into the two groups. Can we do
it anyway we like? Can we let the odd numbered subjects be in the treatment group and the even
numbered subjects be in the control group? Does it matter? The problem of how to assign subjects to
the two groups is our next topic of discussion.

Discussion 1.7.3 Continuing on with the coffee drinking experiment, suppose one month after the
experiment started, the subjects from both groups took the mathematics examination and the number
of passes in each group is shown below.
20 Chapter 1. Getting Data

Treatment group (coffee) Control group (no coffee)


Pass 900 450
Fail 100 550

We see that 90% (900 out of 1000) of the students in the treatment group passed the examination while
only 45% (450 out of 1000) of the students in the control group passed. There seems to be some evidence
that drinking coffee may help a student pass the mathematics examination. Is this evidence convincing?
Can we go one step further and say that coffee causes improvement in passing the examination?
The skeptics among us will probably not be so easily convinced. Possible doubts that could arise and
questions that can be asked could be

1. Maybe the students in the “coffee group” just happen to be better in mathematics and thus have
a higher chance of passing the examination? Or maybe they just have higher IQ than those in the
“no-coffee” group?

2. Maybe many of the students in the “coffee group” had longer revision time before the examination
than those in the “no-coffee” group?

These are some of the possible factors that could have contributed to the difference in passing rates
between the two groups. In trying to establish a cause-and-effect relationship between two variables, we
want to make sure that the independent variable is the only factor that impacts the dependent variable.
In the coffee drinking example, we want to ensure that coffee drinking (or not) is the only variable
that distinguishes the treatment group from the control group. In other words, we need to ensure that
coffee drinking (or not) is the only difference between the subjects in the two groups. All other possible
differentiating factors, for example amount of revision time, should be removed.
How can these factors be “removed”? Surely we cannot mandate that all students in both groups
are only allowed a fixed number of revision hours before the examination! Even if we could, we most
definitely cannot enforce that all students in both groups must have the same IQ! The answer to this is
a powerful statistical method known as random assignment.

Definition 1.7.4 Random assignment is an impartial procedure that uses chance (or probability) to
allocate subjects into treatment and control groups.
How do we perform random assignment for our coffee drinking experiment? The following procedure
can be considered:

1. Write down the name of each student on a piece of paper.

2. Put all the pieces of paper into a box and mix them up.

3. Draw the names out one by one until exactly half the total number of students are chosen. The
names of the students chosen will form the treatment group.

4. The remainder of the students not in the treatment group will form the control group.

The procedure above is just an example of how random assignment can be done. As long as there
is a random element, there can be other procedures to conduct random assignment. It should be noted
that at every draw, each name in the box has an equally likely chance of being chosen. Perhaps there
are still doubters out there who feels that even with such a chance event of assigning the subjects into
treatment and control groups, it may still happen that many of the high IQ students will be assigned to
the treatment group. However, we can be assured that:

If the number of subjects is large, by the law of probability, the subjects in the treatment
and control groups will tend to be similar in all aspects.

Example 1.7.5 The S&P Index is a stock market index of the largest US publicly traded companies.
We are interested in the percentage returns of these S&P companies in 2013. Suppose these percentage
Section 1.7. Study Designs - Experimental Studies and Observational Studies 21

returns were written on 1000 tickets and we are aware that the percentage returns range from −4% to 4%.
Using the method of random assignment, the 1000 tickets are separated into two groups, each comprising
of 500 tickets. The following plots show the distributions of the percentage returns of companies in both
groups.

In each plot, the horizontal axis is the percentage returns and the vertical axis counts the number
of companies with the specific percentage returns. We observe the effect of random assignment as the
distribution in both groups are rather similar.

Remark 1.7.6

1. While performing random assignment to allocate subjects into treatment and control groups, it is
not necessary/possible for both groups to have exactly the same number of subjects. For example,
if we have 501 students to be divided into two groups. As long as some form of random assignment
is done and the number of subjects in each group is big enough, we can still be assured that the
two groups are similar in almost every aspect.

2. When we use the term “random” in random assignment, we do not mean that the assignment is
haphazard. The term random in this case is used in relation to the use of an impartial chance
mechanism that is effected to assign the subjects into two (or more) groups.

Discussion 1.7.7 While random assignment is an important step to take when we divide our subjects
into the treatment and control groups, there is another important consideration when it comes to design-
ing a controlled experiment. If we make it known to the control group that they are indeed the control
group, and therefore not going to receive any form of treatment, this could possibly lead to bias.
To see why this is so, let us return to the coffee experiment. If the subjects in the control group are told
that they will not be assigned any coffee for a month, when we are testing if coffee helps a student pass
the mathematics examination, students in the control group may feel disadvantaged and therefore lack
confidence and motivation to study. This may in turn result in these students not doing well in the
examination and perform poorer than their friends in the treatment group who were given coffee. Any
observed difference in passing rate between the two groups of students may not be the result of coffee
at all. If this happens, the effect of coffee may be overstated.
On the other hand, to the students in the control group, knowing that they will not be given coffee
may actually cause them to take certain measures for their own benefit of passing the examination.
For example, they may study harder and spend more time on their revision which may then result in
the control group performing better than the treatment group in passing the examination. Again, any
observed difference in passing rates between the two groups of students may not be the result of coffee
at all. If this happens, the effect of coffee may be understated.
One way to reduce the anxiety of the control group which could influence the study on the effects of
coffee drinking is to give the subjects in the control group another beverage which tastes and smells the
same as coffee but is without the active ingredients in coffee that is believed to improve one’s cognitive
ability.

Definition 1.7.8 In the previous discussion, the alternative beverage is termed a placebo. A placebo
is an inactive substance or other intervention that looks the same as, and is given the same way as, an
22 Chapter 1. Getting Data

active drug or treatment being tested. In the context of an experiment, a placebo is something given to
the control group that in actual fact, has no effect on the subjects in the group.
However, it has been observed in some instances, subjects in the control group upon receiving the
placebo still showed some positive effects which is likely caused by the psychology of believing that
they are actually being “treated”. This is known as the placebo effect.

Definition 1.7.9

1. One way to prevent the placebo effect from interfering with our experiment and observation on the
benefits (if any) of the treatment is to blind the subjects involved in the experiment. By blinding
the subjects, we mean that they do not know whether they belong to the treatment or control
group. To do this, a placebo that is “similar” to the treatment is given to the control group so
that the two treatments appear identical to the subjects. As a result, subjects do not know which
group they belong to. If we can do this, we would have achieved single blinding.

2. To take blinding one step further, other than blinding the subjects, it may be necessary to consider
blinding the researchers conducting the study as well, especially if measuring the effects of the
treatment may involve subjective assessments of the subjects. For example, in the coffee experi-
ment, if the assessors marking the students’ answers are aware of which group each student belongs
to, they may be inclined to award higher marks to students in the treatment group than those in
the control group. This is because the assessors may subconsciously believe that the treatment is
effective and this could introduce bias in the outcome.
Thus, we should also blind the assessors so that they do not know whether they are assessing the
treatment or the control group. We would have acienved double blinding if subjects and assessors
are blinded about the assignment.

To conclude this discussion on blinding, we should note that sometimes it may not be possible to
blind both the subjects and the assessors (can you think of one such experiment?) but when done right,
double blinding can be very effective in reducing bias in the outcome of the experiment.

Discussion 1.7.10 Besides an experimental study, another study design is an observational study.
Consider the following research question: Does vaccination help reduce the effects of the coronavirus?
If we were to design a controlled experiment, would the following be a possible and reasonable
approach?

ˆ Enrol a group of participants into the study and inject all the participants with low dosages of the
virus strain.

ˆ Perform random assignment to divide the group of subjects into the treatment group and control
group.

ˆ Inject the treatment group with the vaccine and inject a harmless liquid (similar in colour, smell
etc to the vaccine) into the control group, without revealing what they are being injected with.

ˆ Observe the number of participants in each group who develop symptoms similar to a coronavirus
patient.

It is interesting to note that this is not a hypothetical situation. In fact, in 2020, during the COVID-
19 pandemic, a Dublin-based commercial clinical research organisation was reported to be planning an
experiment to test the effectiveness of a COVID-19 vaccine2 . The plan was similar to the approach
described above.
2 Ewen Callaway, 2020. ”Dozens to be deliberately infected with coronavirus in UK ‘human challenge’ trials,” Nature,

vol. 586(7831), pages 651-652, October.


Section 1.7. Study Designs - Experimental Studies and Observational Studies 23

You probably realise by now that it is not so straightforward to design a controlled experiment like
this. There are obvious ethical issues that need to be addressed. Some immediate questions that need
to be answered are

1. Should we inject such a virus into humans in the first place?

2. How should we decide who is to be assigned to the treatment group and who is to be assigned to
the control group?

3. Is it fair not to let the subjects know if they are injected with the vaccine or with a placebo? Should
we obtain consent from the subjects at the beginning of the study?

Experiments can give us useful evidence for a cause-and-effect relationship. However, not all research
questions are suitable to be investigated using an experiment, sometimes due to ethical issues like those
listed above. Therefore, we need to consider the pros, cons and feasibility of an experimental study
before deciding if we should proceed.

Definition 1.7.11 An observational study observes individuals and measures the variables of interest,
usually without any direct/deliberate manipulation of the variables by the researchers.

Remark 1.7.12 Observational studies are alternatives to experiments that can be used when we are
faced with ethical issues in experiments. An observational study observes individuals and measures
the variables of interest. As researchers usually do not attempt to directly manipulate or change one
variable to cause an effect in another variable, observational studies do not provide convincing evidence
of a cause-and-effect relationship between two variables.

Example 1.7.13 We would like to investigate whether exercising regularly (defined as exercising at
least 3 times a week, at least 30 minutes of strenuous exercise each time) is associated3 with having a
healthy body mass index (BMI) (defined as between 18.5 to 22.9 kg/m2 ) for Singaporean men between
the ages of 30 to 40 years old.
Participants were recruited into the study and by their own declaration, they were classified into
either the “treatment” group (those who exercise regularly) or the “control group” (those who do not).
Participants were then told to proceed with their usual lifestyle habits and their body mass index were
measured after 3 months. The following table summarises the findings at the end of the study.

Treatment Control
(Exercise regularly) (Do not exercise regularly)
Healthy BMI range 320 127
Outside Healthy BMI range 101 191
3 previously mentioned in Example 1.3.3, the topic on association between variables will be discussed in Chapter 2.
24 Chapter 1. Getting Data

This is an example of an observational study. Do you think there is sufficient evidence of association
between exercising regularly and having a healthy BMI? We will discuss more questions like this in
subsequent chapters.
Let us conclude this chapter with some final remarks on study designs.

Remark 1.7.14

1. Not all research questions can be studied practically using an experiment. For example, if we would
like to investigate if long term smoking is linked to heart disease, it is extremely difficult to design
an experiment and put subjects into the treatment group where they will be required to smoke for
the long term, even if this is against their will. This is challenging and unethical. An observational
study may be more suitable for such an investigation.

2. For observational studies, there is no actual treatment being assigned to the subjects but we
normally still use the term treatment and control in the same way as though we are dealing with
an experiment. For the investigation on smoking and heart disease, smokers who are observed to
be smoking over a long period of time will be in the treatment group while non-smokers are in the
control group. Sometimes, we may use the term exposure group instead of treatment group and
non-exposure group instead of control group.

3. For experimental studies, subjects are assigned into either the treatment or control group by the
researcher. For observational studies, subjects assign themselves into either the treatment or con-
trol group.

4. Observational studies cannot provide evidence of cause-and-effect relationships. On the other hand,
experimental studies can provide such evidence if it has the features of randomised assignment and
blinding (preferably double blinding).

5. The question of generalisability is often asked. That is,

If an experiment is well-designed, can the conclusion of the experiment based on


a sample be generalised to the population from which the sample was drawn?

Having a good design is not the only important piece of the puzzle. In order to generalise the
results from a sample to a bigger population, there are other factors that are equally important,
for example, the sampling frame, sampling method, sample size and response rate.
Exercise 1 25

Exercise 1
1. The following is a research question from a scientific journal.

What percentage of Singaporeans are keen to take vaccine X?

What type of research question is this?

(A) Make an estimate about the population.


(B) Test a claim about the population.
(C) Compare two sub-populations.
(D) All of the other options.

2. Drug X is a new drug created. It is intended to be taken as a tablet by people who have skin allergy
reactions. However, before pushing it into the market, researchers need to test the effectiveness of
drug X. Thus, they designed a study, with two groups - a treatment and a control group. Subjects
with skin allergy reactions were invited to the study and placed into either of the two groups. The
subjects in the treatment group received a drug X tablet to consume. Subjects were studied to
see if their allergic reactions were successfully alleviated and were marked as either ‘successful’ or
‘unsuccessful’.
What are the possible tablets to give to the control group subjects, for comparing drug X’s effec-
tiveness, assuming all tablets look and taste the same? Select all that apply.

(A) An empty tablet.


(B) A tablet containing glucose. It is a definite known fact that glucose has a 2% better success
rate than an empty tablet.
(C) A tablet containing salt, with an unknown success rate.

3. Which of the following scenarios is an example of random assignment in a controlled experiment?

(I) For each subject, Peter throws a fair die of six sides. Subjects are assigned based on the number
shown on the top surface of the die. Before the start of the assignment, Peter determines that
the numbers “1”, “2” and “6” will be assigned to the treatment group, and “3”, “4” and “5”
to the control group.
(II) James lists all subjects in the experiment by alphabetical order, and selects subjects whose
last name starts with “A”, “B”, “C” till “M” to place in the treatment group, while subjects
whose last name starts with “N”, “O”, “P” till “Z” are placed in the control group.

(A) Only (I).


(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).

4. Which of the following statements is/are always true about controlled experiments and observa-
tional studies?

(I) There is no control group in observational studies.


(II) Randomised assignment of subjects does not occur in observational studies.
(III) There are no confounders in controlled experiments.

(A) Only (I) and (II).


(B) Only (I) and (III).
26 Chapter 1. Getting Data

(C) Only (II) and (III).


(D) Only (II).

5. Siti conducted an investigation by randomly assigning each subject from a randomly selected sample
of 50 participants, either to watch Netflix for 1 hour 4 times a week, or to listen to Symphony 924FM
for 1 hour 4 times a week. After 6 months, the changes in the subjects’ blood pressure readings
over the same period were recorded. The changes were compared for the two groups. Which of the
following is true?

(A) This is a randomised experiment because blood pressure was measured at the beginning and
end of the study.
(B) This is an observational study because the two groups were compared at the end of the study.
(C) This is a randomised experiment because the participants were randomly assigned to either
activity.
(D) This is a randomised experiment because a random sample of participants was used.

6. A medical researcher assigned 80000 patients to receive either a new drug or an old drug randomly.
Among the 40123 patients who received the new drug, 24007 were male. Among the 80000 patients,
what is the most likely proportion of females?

(A) 20%.
(B) 30%.
(C) 40%.
(D) 50%.

7. May, an owner of a tuition center, wishes to find out if using iPads during tuition class improves
her students’ academic performance. She decided to conduct an experiment as follows:

1. She groups all the students in her center according to the day they come for tuition. For
simplicity’s sake, we can assume each student only goes for tuition once per week, there is at
least one class of tuition every day in her center, and no student drops out halfway.
2. Every student who goes for tuition on weekends will be given an iPad to use during class.
The students who go for tuition on weekdays will not be given an iPad.
3. She then keeps track of all her students’ academic performance for the next 6 months.

Which of the following statements is/are true?

(I) She used a probability sampling method.


(II) This is a controlled experiment without random assignment.

(A) Only (I).


(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).

8. A researcher has invited 500 people to participate in his study. He uses random assignment to assign
the subjects into the Treatment and Control groups. The Treatment group has 200 subjects, and
the demographics of the 200 subjects are as follows:

Male Female
Old 83 18
Young 32 67
Exercise 1 27

The 300 remaining subjects are in the Control group. The researcher should expect the number of
young males in the Control group to be around .

(A) 32.
(B) 48.
(C) 51.
(D) 173.

9. Adam is a supervisor of the Call-Centre department of his company. He is interested to know if


there is a relationship between having mid-day naps and the average number of calls completed
per individual among all 500 workers in his department. Of the 500 workers, 400 are females and
100 are males. He uses a randomised mechanism in assigning 250 of the workers to the treatment
group and 250 of the workers to the control group. Workers in the treatment group are given a
mid-day nap between 2p.m.–2.30p.m. every day, while those in the control group were not given
the mid-day nap. We assume that all in the treatment group took their daily nap of 30 minutes.
The number of calls cleared by the individual workers were recorded for a month, and it is noted
that there is a positive association between nap and daily average number of completed calls among
the 500 workers.
Which of the following must be true?

(A) The study’s findings are applicable to the target population of interest.
(B) There will be 50 males and 200 females randomly assigned to the treatment group, and 50
males and 200 females randomly assigned to the control group.
(C) The above is an example of a randomised, double-blind controlled experiment.
(D) None of the other options.

10. Suppose that in an experimental study on tea consumption and its association with blood pressure
level, we have 500 participants who were assigned into two groups – the treatment and control
groups. Participants were randomly given a number from 1-500, following which participants
numbered 1 - 250 were assigned to the treatment group whilst the rest were assigned to the control
group. Individuals in the treatment group were given freshly brewed Osmanthus Tea, while those
in the control group were given plain water. To alleviate any concerns about the nature of the
drink, the drinks were prepared in front of the subjects and the assessors, regardless of whether the
subjects were receiving tea or plain water. Which of the following best describes the study above?

(A) The above is a non-randomised, non-blinded controlled experiment.


(B) The above is a randomised, non-blinded controlled experiment.
(C) The above is a non-randomised, single-blinded controlled experiment.
(D) The above is a randomised, single-blinded controlled experiment.

11. From the options given, select all possible words that can be used to complete the sentence below.

Probability sampling refers to a sampling process whereby the probability of selection of individuals
within the sampling frame must be .

(A) non-zero
(B) known
(C) high
28 Chapter 1. Getting Data

12. Paracetamol company NAS owns a tablet press machine that produces Paracetamol tablets. On
one shift, 3000 batches of tablets were manufactured. Each batch contains 10 tablets - a total of
30,000 tablets were manufactured. A researcher wants to ensure the dosage in the tablets is correct
but has no time to check every single tablet. Hence, she decides to sample some of the tablets
instead.
Which of the following describes a probability sampling method? Select all that apply.

(A) Select 3000 tablets at random.


(B) Label all the tablets in each batch from 1 to 10, select a number from 1 to 10 at random, and
select the unit from every batch that corresponds to that number.
(C) Select 300 batches at random, and then sample all tablets in every selected batch.
(D) Select the first 3000 tablets that were manufactured.

13. A recent study revealed that Singapore is “the most tired country in the world, due to work and
internet.” A researcher decided to conduct a further study on internet usage behaviour and working
hours among all Singaporean adults in Singapore. Data was collected by interviewing commuters
alighting from Pasir Ris MRT (East), Woodlands MRT (North), Redhill MRT (South) and Jurong
East MRT (West) from 8am to 11pm over a period of 7 days.
Which of the following statements is necessarily true?

(A) As data was collected from different parts of Singapore, it is generalisable to the population
of Singapore.
(B) Due to the equal representation of Northern, Southern, Eastern and Western parts of Singa-
pore, selection bias is minimised.
(C) In this example, non-response bias exists because of a bad sampling plan.
(D) None of the other options.

14. The United States government conducts a Census of Agriculture every five years. The census
comprises farmland usage in all the 50 states in the country. John generated a sample of 3000
counties across all states from this census. He then collected data on the number of acres of land
space these counties in the sample devoted to farms, and summarised his findings in a report as
follows:

(I) Of the 3000 counties selected, 25 counties were selected more than once in the sampling
process.
(II) 18% of the counties selected in this sample were from the state of Virginia, while none were
from the states of Alaska, Arizona, Connecticut, Delaware, Hawaii, Rhode Island, Utah or
Wyoming.

John claimed that he obtained the sample of 3000 counties by Stratified Sampling with replace-
ment, with the stratum being every state in the United States. Assuming that statements (I)
and (II) are true, which of the statements do not/does not support John’s claim on his sampling
method?

(A) Only (I).


(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).
Exercise 1 29

15. For the following two cases, determine which sampling plan was used.
Case 1: In an opinion poll, an airline company made a list of all its flights on 1 Jan 2022 and then
selected a simple random sample of 30 flights. All the passengers on those flights selected were
asked to fill out a questionnaire form.
Case 2: A departmental store wanted to find out if customers would be willing to pay slightly
higher prices for their products in order to have a smartphone app which customers can use to help
them locate items in the store. The store hired an interviewer John and placed him at the only
entrance on a particular day. John was asked to collect a sample of 100 opinions by interviewing
the next person who came through the entrance each time he finishes an interview.

(A) Case 1: Cluster sampling plan; Case 2: Systematic sampling plan.


(B) Case 1: Stratified sampling plan; Case 2: Systematic sampling plan.
(C) Case 1: Stratified sampling plan; Case 2: Non-probability sampling.
(D) Case 1: Cluster sampling plan; Case 2: Non-probability sampling.

16. A military officer was interested in reducing the number of casualties sustained in aerial battle.
His population of interest was all planes under his charge. He tasked his men to examine the
planes that returned from the war front, and then take note of which parts of the planes sustained
ammunition damage. He collated all the data and presented it on a single blueprint of the plane,
as shown below (the dots denote where ammunition damage occurred):

The officer then concludes: “Based on my sample data, I propose to fortify the plane armour for
regions where ammunition damage was concentrated (using the above blueprint as a guide), so as
to help these planes survive better.” Would you agree with his assessment and why?

(A) Yes. The sample collected came from a good sampling frame.
(B) No. The sample collected came from an imperfect sampling frame.
(C) Yes. The sample size is big enough.
(D) No. The sample size is too small.

17. If a sampling frame is the target population, it will not lead to a loss in the general-
isability of the results from the sample to the population.
Which of the following can be used to fill the blank appropriately? Select all that apply.

(A) equal to
(B) smaller than
(C) larger than

18. Tom selected 4 samples of 20 integers from the population {1, 2, . . . , 100} using 4 different methods.
They are

1. simple random sampling (SRS).


30 Chapter 1. Getting Data

2. stratified sampling: the population was divided into the 10 strata {1, 2, . . . , 10}, {11, 12, . . . , 20},
. . ., {91, 92, . . . , 100}; and a SRS of 2 numbers was drawn from each of the 10 strata.
3. cluster sampling: the population was divided into 20 clusters {1, 2, 3, 4, 5}, {6, 7, 8, 9, 10}, . . .,
{96, 97, 98, 99, 100}; and a SRS of 4 of these clusters was selected.
4. systematic sampling: a random starting point between 1 to 5 was selected; and every 5th unit
thereafter was selected too.

He created dot plots for exactly 3 of the samples generated. Identify the sampling method depicted
by each of the following plots.

Sampling method depicted by Figure 1:


Sampling method depicted by Figure 2:
Sampling method depicted by Figure 3:

19. A study was conducted to find out the marital status of students who had graduated from univer-
sity. The population of interest was: University XYZ students who had graduated in 2019. The
study questionnaire was sent to all University XYZ 2019 graduates. 20% of the said graduates
responded to the survey. Among those who responded, 30% are married, 65% are single and 5%
are divorced/widowed. Are the above results likely to be a good reflection of the actual marital
status of all XYZ graduates in 2019?

(A) Yes, because the questionnaire was sent to the entire population of interest.
(B) Yes, because there is good representation of every marital status in the results.
(C) No, because the sampling method is non-probability.
(D) No, because only 20% of the said graduates responded to the survey.

20. A researcher wishes to estimate the average IQ of Primary 4 students studying in government
schools in Singapore. He carries out the following procedures to arrive at his estimate.

ˆ He collates a list of all government Primary Schools in Singapore located within a 5km radius
of where he stays, since it makes traveling to the schools easier for him.
ˆ From the collated list, he contacts the principal of each school and asks for permission to
conduct the IQ test on 50 Primary 4 students selected from that school via simple random
sampling. All the contacted principals were able to obtain consent from all the parents of the
selected students to conduct the IQ test.
Exercise 1 31

ˆ He conducts the IQ test for all the selected students and then proceeds to calculate the average
IQ.
You may assume that the following statements are true:
ˆ He has not made any mistakes in the marking of the IQ test.
ˆ He has not made any mistakes in the calculation of the average IQ.
ˆ The selected students attempted the test to the best of their ability.
Based on the above description, which of the following statements must be true?
(A) The researcher has employed cluster sampling in the selection of students.
(B) There is no selection bias present in the study.
(C) The calculation of the average IQ is likely to be a good estimate of the average IQ of Primary
4 students studying in government schools.
(D) There is no non-response bias in his study.
21. Select the correct word from the list for the respective blank in the sentence.
“The (1) is used to quantify the degree of spread relative to the (2) and is a useful
statistic for comparing the degree of variation across different variables within a data set.”
List: Coefficient of variation, interquartile range, standard deviation, mean,
median.
22. Let x1 , x2 , . . . , xn be values of a numerical variable x within a data set containing n points.
Which of the following statements are definitely true with regards to the standard deviation?
Select all that apply.

(A) If the standard deviation of x is 0, then xi = x for all i ranging from 1 through n.
(B) If the standard deviation of x is 0, then xi = 0 for all i ranging from 1 through n.
(C) If xi = c, for all i ranging from 1 through n, where c is a constant, then the standard deviation
of x is 0.
(D) If the mean of x is 0 in the data set, then the standard deviation of x is also 0 in the data set.

23. A telecommunication company is interested in understanding how many mobile phones people
own. Their population of interest is all 2000 people in town X. They took a random sample of
100 people from town X. Assuming there is 100% response rate, which of the following statements
is/are correct?

(I) If among the 100 people sampled, every person has 2 or more mobile phones, then the mean
number of mobile phones in the sample will be greater than or equal to 2.
(II) If the mean number of mobile phones in this sample is greater than or equal to 2, then everyone
among the 100 people sampled has 2 or more mobile phones.

(A) Only (I).


(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).

24. An examination was given to Class A and Class B, which consisted of 20 students each. The score
of each student is between 0 and 100.
The range of scores in Class A is from 70 to 90. All the students in Class B scored less than 40
marks. Due to manpower shortages, Class A and Class B were combined to form Class C. Hence
Class C now contains 40 students, who were previously from Class A and Class B.
Which of the following statements about the relationship between the mean score in Class C and
the mean score in Class A is always true?
32 Chapter 1. Getting Data

(A) The mean score in Class C must be lower than the mean score in Class A.
(B) The mean score in Class C must be the same as the mean score in Class A.
(C) The mean score in Class C must be higher than the mean score in Class A.
(D) There is insufficient information to deduce the relationship between the mean score of Class
C and the mean score of Class A.

25. Consider the following numerical values:

14, 15, 18, 20, 24, 29, 33, 34, x,

where x is unknown and x may not necessarily be greater than or equal to 34. Which of the
following statements is/are necessarily true? Select all that apply.

(A) Regardless of the value of x, the median can never be higher than 24.
(B) If the median of the values is less than 24, the mode cannot be 24.
(C) The range cannot be 24.

26. Which of the following statements is/are always true, for any given data set?

(I) The first quartile, Q1 , is less than the third quartile, Q3 .


(II) The standard deviation is greater than 0.

(A) Only (I).


(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).

27. The 2018 study “Tea Consumption and Longitudinal Change in High-Density Lipoprotein Choles-
terol Concentration in Chinese Adults” published in the Journal of the American Heart Association
found that tea appears to slow the natural decrease in heart-helping HDL cholesterol as a person
ages. A total of 80182 participants were asked to report on their baseline characteristics including
the following: Sex (Male / Female), Physical activity level (Inactive / Moderately active / Active),
Income level (“<600” / “600 to 1000” / “>1000” per month), and Body Mass Index (kg/m2 ).
Which of the following statements is true about the type of variables of Sex, Physical activity level,
Income level and Body Mass Index, respectively?

(A) Categorical ordinal, Categorical nominal, Categorical ordinal, Categorical ordinal.


(B) Categorical ordinal, Categorical ordinal, Categorical nominal, Numerical.
(C) Categorical nominal, Categorical ordinal, Categorical ordinal, Numerical.
(D) Categorical nominal, Categorical nominal, Categorical ordinal, Numerical.

28. An instructor of a new industrial attachment module decided to conduct a quick post-course survey
to gauge its reception among students. The information collected included the students’ sex (“1”
for male, “2” for female), days attended (how many days each student attended the course),
satisfaction level (“1” for not satisfied, “2” for neutral, “3” for very satisfied), and each student’s
score on the final exam of the pre-requisite module (out of 100). The results are shown in the
following table.

Name Sex Days attended Satisfaction Exam marks


Peter 1 2 1 73
Paul 1 2 1 77
Mary 2 3 2 84
Josephine 2 3 2 89
Frank 1 4 3 93
Exercise 1 33

Which of the following is true about the type of variables of Sex, Days attended, Satisfaction, and
Exam marks, respectively?

(A) Categorical ordinal, Numerical, Categorical ordinal, Numerical.


(B) Categorical ordinal, Categorical ordinal, Categorical nominal, Numerical.
(C) Categorical nominal, Numerical, Categorical ordinal, Numerical.
(D) Categorical nominal, Categorical nominal, Categorical ordinal, Numerical.

29. Consider the following 9 whole numbers:

14, 15, 15, 18, 20, 21, 24, 25, 26.

There is a 10th unknown whole number, x, which belongs to the list of numbers above. We know
that the interquartile range of these ten numbers is 9. Which of the following is/are possible values
of x? Select all that apply.

(A) 10
(B) 15
(C) 20
(D) 25

30. A teacher wanted to know if there was any trend in terms of how his class scored for two different
in-class quizzes within a particular semester. For all 10 students in his class, he obtained their
scores for the two different quizzes, which we will refer to henceforth as Quiz A and Quiz B. The
results from both quizzes are represented in the following dot plots, as follows:

From the information given, which of the following statements is/are true?

(I) The coefficient of variation of Quiz A scores is greater than the coefficient of variation of Quiz
B scores.
(II) For each student let x be the Quiz A score and let y denote the Quiz B score. The correlation
coefficient between the two scores can be negative.

(A) (I) only.


(B) (II) only.
(C) Both (I) and (II).
(D) Neither (I) nor (II).
This page is blank
Chapter 2

Categorical Data Analysis

Section 2.1 Rates

In Section 1.3, we learnt that there are two main types of variables, namely categorical variables and
numerical variables. For categorical variables, there are two sub-types, namely ordinal variables and
nominal variables. Ordinal variables are those whose categories come with some natural ordering. On
the other hand, there is no intrinsic ordering for the nominal variables. For numerical variables, there
are those that are continuous and those that are discrete. The focus of this chapter is on categorical
variables and we will discuss numerical variables in the next chapter.
Much of the discussion in this section is centred around the following example.

Example 2.1.1 Suppose a patient newly diagnosed with kidney stones visits his urologist for the first
time since diagnosis to discuss what are some of the best possible treatments that he should undergo.
In preparation, the urologist took out some historical records of the various patients he had previously
and summarised the data into a table. Part of the table is shown below.

Size of stone Gender of patient Treatment type (X or Y) Outcome


Large Male X Success
Large Male X Success
Small Male Y Success
Large Male Y Failure
Small Male X Success
Large Male Y Success

Each row of the table is a particular patient that the urologist had seen previously and the columns
are the variables related to each patient. While the table only shows the first 6 cases, the data in actual
fact contains 1050 observations (or data points). The four variables are

1. The size of the kidney stone. This is an ordinal categorical variable that has two categories. The
kidney stones can be classified as either small or large.

2. The gender of the patient. This is a nominal categorical variable that has two categories, male or
female.

3. The treatment that the patient underwent. Again, this is a nominal categorical variable and there
are two categories, namely treatment X and treatment Y.

4. The outcome of the treatment is also a nominal categorical variable. The categories are success
and failure.
36 Chapter 2. Categorical Data Analysis

How should the urologist use the 1050 observations to assist in the decision for this new patient?
Before we continue, let us recall the PPDAC cycle that was introduced as the main process behind
the approach to a data driven problem.

The overarching question faced by the urologist is simply how to treat his patient better. In particular,
this new patient. What kind of insights does the data set give the urologist that will enable him to
better advise his patient?
To apply the PPDAC cycle to this context, let us start with a question that we want to answer. A
simple question to start with is:

ˆ Question 1: Are the treatments given to the patients successful? In other words, should this new
patient receive treatment?

Moving on from “Problem” to “Plan”, we next determine what are the variables that needs to be
measured and then proceed to obtain data on 1050 previous cases where the outcome of the treatment
was recorded as either a success or failure. The PPDAC cycle is a continuous process where after looking
at the data, drawing some preliminary conclusions might lead to more questions, some of which were
even considered from the start. This stage of analysis involves sorting the data, tabulating and plotting
graphs of the outcome variable. We may observe interesting trends and this leads us to asking more
questions on those trends, leading us back to the top of the cycle again. Some of the new questions that
we can ask include

ˆ Do males undergoing treatment X have a higher rate 1 of success than females?


ˆ Does treating large kidney stones with treatment X have a higher rate of success than treatment
Y?

We will now discuss some of the tables and charts that can be generated from the data that will give
us useful information.

Example 2.1.2 (Analysing 1 categorical variable using a table.) Suppose out of the 1050 pre-
vious patients, there were 831 records of success and 219 records of failure after treatment was given.
Thus from this simple collation, a preliminary conclusion is that we should generally recommend the
new patient to go for treatment since there are more successful outcomes than failed outcomes. We
can present this information on the number of success and failures in a table, together with two other
columns, namely rate and percentage.
Categories of the Count Rate Percentage
“Outcome” variable
831
Success 831 rate(Success) = 1050 = 0.791 0.791 × 100% = 79.1%
219
Failure 219 rate(Failure) = 1050 = 0.209 0.209 × 100% = 20.9%
1050
Total 1050 1050 = 1 1 × 100% = 100%
1 The concept of rates will soon be discussed in this section.
Section 2.1. Rates 37

The rate of successful treatments is simply

Number of successful treatments 831


= = 0.791.
Total number of treatments 1050

We can also represent this as a percentage of total treatments that were successful, which is 79.1%.
Similarly, the rate of failed treatments is

Number of failed treatments 219


= = 0.209.
Total number of treatments 1050

When represented as a percentage, the percentage of failed treatments is 20.9%.


For much of this chapter, we will be using rates in our discussion of the behaviour of categorical
variables. Intuitively, we can also think of rate as a fraction, proportion or a percentage. This is useful
for understanding some of its properties. For example, we note that

0% ≤ rate(X) ≤ 100% (if we think of rate as a percentage); or

0 ≤ rate(X) ≤ 1 (if we think of rate as a fraction).

(Here, X is some variable of interest.)

Example 2.1.3 (Analysing 1 categorical variable using a plot.) As an alternative to a table, we


can also use easily available softwares to create a plot that presents the data.

Since we are interested in the variable Outcome, which is a categorical variable, we can illustrate
the counts in each of the categories “Success” and “Failure” in the form of a bar plot. The two bar plots
above are created using Microsoft Excel.
The bar plot on the left is known as a dodged bar plot. The x-axis indicates the variable Outcome
whereas the y-axis shows the number (that is, the count) of successes and failures in the variable Outcome.
Two bars, one for success counts and the other for failure counts, are placed next to each other. Such
an illustration is useful in comparing the relative numbers in the categories.
The bar plot on the right is known as a stacked bar plot. The x and y-axes are similar to the dodged
bar plot but instead of two bars, we now have only one bar where the counts of failure (219) is stacked
on top of the counts of success (831). Such an illustration is useful in comparing the occurrences of each
category as a percentage or fraction of the total number of responses. Instead of showing the absolute
numbers in each category, it is also possible to show the percentages directly in the plot itself, as seen in
the figure below. However, it should be noted that the y-axis is now giving the percentages rather than
the actual numbers.
38 Chapter 2. Categorical Data Analysis

Regardless of which bar plot is used, we can see that there are many more successes than failures and
based on this, it is reasonable to recommend our patient to go for some form of treatment based on the
information that we have at this stage.

Remark 2.1.4 In this example on treatment of kidney stones, the success of any treatment is defined
as having the kidney stones removed or reduced significantly so that it does not pose any further threat
to the patient. On the other hand, failure means that the stones were not able to be removed. In general,
kidney stones cause little morbidity and mortality. It is useful to note that for other kinds of illness,
where treatments have higher stakes, the conclusion may be different.
Now that we are rather convinced that the new patient should receive treatment, the PPDAC cycle
brings us back with new questions that arise from our investigation into the data set of 1050 previous
patients. It is reasonable to ask the next question as follows:

ˆ Question 2: There are two types of treatment, namely X and Y. Which treatment type is better
for our new patient?

To answer this question, we can revisit the PPDAC cycle and define a new problem and plan to look
at new variable(s) of interest and analyse the data again using plots that we have introduced previously.

1. The new problem is as stated above, namely, which treatment is better for our new patient.

2. This means that the key variable that we should look at is the treatment type categorical variable,
which has two categories, treatment X and treatment Y.

3. This does not mean that treatment type is the only variable of interest, but rather, it should
be investigated together with the outcome variable. This is because we want to know how the
treatment type affects the outcome.

This leads us to our discussion of how to analyse two categorical variables.


Section 2.1. Rates 39

Example 2.1.5 (Analysing 2 categorical variables using a table.) When we used a table to
analyse 1 categorical variable (Outcome), the table showed only the number of successes and failures
among the 1050 previously treated patients. When we introduce a second categorical variable (Treatment
type), we have a 2 × 2 contingency table that will summarise the two variables across the 4 (that’s why
it is called 2 × 2) possible combinations of (Treatment, Outcome).

Outcome
Success Failure Row Total
Treatment
X 542 158 700
Y 289 61 350
Column Total 831 219 1050
Recall that out of 1050 previous patients, 831 underwent successful treatments while the other 219
were failed treatments. The 2 × 2 table breaks down the 831 successful treatments according to the
treatment type. As seen from the Success column, 542 were given treatment X while 289 were given
treatment Y. Similarly, for the 219 failed treatments, 158 were given treatment X while 61 were given
treatment Y.
If we look across a row instead of down a column, we could, for example, see that there were 700
previous patients given treatment X, of which 542 were successful and 158 failed. Similarly, looking at
the row for treatment Y, we see that out of 350 people who underwent treatment Y, 289 of them had
successful treatments while 61 did not.

Remark 2.1.6
1. It should be noted that by convention, the dependent variable Outcome is placed on the columns
on the table while the independent variable Treatment type is placed on the rows.

2. The column total values for the success (831) and failure (219) columns should add up to the same
values as the sum of the row total values for Treatment X (700) and Treatment Y (350), which
obviously should both add up to the total number of data points in the data set which is 1050.

Discussion 2.1.7 In order to answer Question 2, it will be useful to ask other related questions, for
example:
1. Question 2a: What proportion of the total number of patients were given treatment Y (or X)?

2. Question 2b: Among those patients given treatment X, what proportion were successful?

3. Question 2c: What proportion of patients were given treatment Y and had a failed treatment
outcome?
To answer Question 2a, we note that there were 350 previous patients who underwent treatment
Y. The proportion of the total number of patients that underwent treatment Y is
350 1 1
= = 33 %.
1050 3 3
We can also denote this as
1
rate(Y) = 3 or 33 13 %.
We have seen earlier that out of 1050 patients, there were 831 successful treatments, so we can write
rate(Success) = 0.791 or 79.1%. We know that
700 2
rate(X) = 1050 = 3 or 66 32 %.

Outcome
Success Failure Row Total
Treatment
X 542 158 700
Y 289 61 350
Column Total 831 219 1050
40 Chapter 2. Categorical Data Analysis

Notice that in the calculations above, we have used two numbers in the margin of the table (for
example, 831 and 1050) that relate to just one of the categorical variables (Outcome) each time, we
350
call these marginal rates. Similarly, rate(Y) = 1050 = 31 is also a marginal rate.
How should we answer Question 2b? In this case, we need to zoom in onto the patients who had
undergone treatment X and figure out what proportion of them have had a successful treatment.
Referring to the table again, we see that out of 700 patients who were given treatment X, 542 of them
were successfully treated. Hence the proportion of successful treatments was
542
= 0.774 = 77.4%.
700
This rate of success is computed based on only those patients who were under treatment X, which sets
the condition for the calculation of the rate. Once such a condition is set, those patients on treatment
X will be considered as the population and those on treatment Y will not be part of any consideration.
Such a rate is known as a conditional rate, which is one that is based on a given condition.
A note on the notation used for conditional rates is that we replace the word “given” by a vertical
bar so that rate(Success given treatment X) is written as

rate(Success | X).

Let us consider Question 2c. From the table, we can see easily that there are 61 cases where
treatment Y was given but had an unsuccessful outcome.

Outcome
Success Failure Row Total
Treatment
X 542 158 700
Y 289 61 350
Column Total 831 219 1050

So the rate of patients who were given treatment Y AND had a failure was
61
= 0.0581 = 5.81%.
1050
This rate is known as a joint rate and it is not a conditional rate since we are looking at all the 1050
patients as our baseline. In other words, we are now considering patients on treatment X, as well as
patients on treatment Y as the population. One should be careful with the implicit difference in the
phrasing of the two statements:

ˆ What proportion of patients were given treatment Y and had an unsuccessful outcome?

61
Answer: rate(Unsuccessful and Y) = 1050 = 0.0581.

ˆ What proportion of patients given treatment Y had an unsuccessful outcome?

61
Answer: rate(Unsuccessful | Y) = 350 = 0.174.

The first question refers to the joint rate/proportion/percentage while the second question refers to the
conditional rate/proportion/percentage.

Discussion 2.1.8 It should be clear at this point from our discussion of rates and proportions that
our decision on which treatment to suggest to our new patient cannot be based simply on the absolute
number of successes and failures for each treatment type. If we had based the decision on absolute
numbers, we would have gone for treatment X since there were 542 success cases compared to only 289
for treatment Y.
Section 2.1. Rates 41

The reason why we should look at rates rather than absolute numbers is because the number of pa-
tients undergoing each treatment is different, so it would not be surprising if there were more successful
cases for treatment X because there were just more patients given this treatment, rather than because
it is more effective. Finding the rate of success for each treatment before comparing them is a form of
normalisation. At this stage of our analysis, when using the success rates to compare the treatment
types, our conclusion is to recommend treatment Y to our patient.

ˆ The rate of success given treatment X is the conditional rate we have already calculated in answering
Question 2b, which is
542
rate(Success | X) = = 0.774 = 77.4%.
700
Similarly, we can calculate the rate of success given treatment Y, which is
289
rate(Success | Y) = = 0.826 = 82.6%.
350

ˆ We can also look at the conditional rates in another way. For treatment X, having the conditional
rate of success to be 77.4% means that out of 100 patients who underwent treatment X, 77 of them
had a successful outcome. For treatment Y, the numbers were 83 successes out of 100 patients
receiving this treatment.

ˆ As the rate of success for treatment Y is higher, we can now say that treatment Y is better than
treatment X and advise the patient appropriately. Notice that we would have given the opposite
advice if we were looking at absolute numbers instead of rates, which is incorrect.

ˆ We can now add in the rates to the 2 × 2 contingency table given at the beginning of Example
2.1.5.

Outcome
Success Failure Row Total
Treatment
X 542 (77.4%) 158 (22.6%) 700 (100%)
Y 289 (82.6%) 61 (17.4%) 350 (100%)
Column Total 831(79.1%) 219 (20.9%) 1050 (100%)

Example 2.1.9 (Analysing 2 categorical variables using a plot.) In Example 2.1.3, we introduced
dodged bar plots and stacked bar plots to present the data on a single variable Outcome. We can also
use these plots to present the counts of Outcome broken down by Treatment. These were the two
variables we analysed using a table in Example 2.1.5.

The dodged bar plot on the left shows the success and failure counts for both treatments X and Y.
The numbers above each bar is the success or failure count for that particular treatment type.
The stacked bar plot on the right shows the same information but with the success and failure bars
under the same treatment stacked instead of being placed side by side.
42 Chapter 2. Categorical Data Analysis

Both bar plots tell us that there are a lot more successful treatments in treatment X than in treatment
Y, which may lead to the conclusion that treatment X is more effective (since the green bars for treatment
X are bigger than the green bars for treatment Y). However, it is also obvious from the stacked bar plot
that these two treatments have very different number of patients (represented by the height of the two
bars).
Similar to our analysis using tables, we can also create the plots using rates instead of absolute
numbers.

In this plot, notice that both the treatment X and treatment Y bars have been normalised to the
same height (which is 100%). We are no longer comparing absolute numbers, but instead comparing
the rates of success (the height of the green bars, as a proportion of the total height) between the two
treatments. We can see immediately that treatment Y has a higher rate of success (taller green bar)
compared to treatment X.
To summarise this section, we have discussed how we can analyse two categorical variables. This can
be done either using a 2 × 2 contingency table, or bar plots (dodged or stacked) which makes it easier
for us to observe any differences between the categories. We also introduced the concept of rates, as a
means of fair comparison when group sizes are unequal. To formally discuss the relationship between
two categorical variables, we will introduce the concept of association in the next section.

Section 2.2 Association

Definition 2.2.1 In Section 2.1, we considered the example of two different treatments for patients
with kidney stones. Let’s say that initially, we guessed that the treatment type involved does not affect
the outcome of the treatment, meaning that we could advise our patient to undergo either treatment
because the outcome would not be affected. If this was the case, then we can say that the treatment
type is not related to the outcome of the treatment.
After analysing the data using rates, we found that this was not the case. There was a higher success
rate observed for patients under treatment Y compared to those under treatment X. Due to the difference
in success rates, we say that there is a relationship between the type of treatment and the outcome of
the treatment.
To formalise the notion of such a relationship, we say that treatment type is associated with the
outcome of the treatment. More specifically, treatment Y is positively associated with the success of the
treatment. What this means is that treatment Y and successful treatments tend to occur together.
On the other hand, we say that treatment X is negatively associated with the success of the treatment.
This is because we tend to see treatment X and failed treatments go hand in hand.

Remark 2.2.2
Section 2.2. Association 43

1. Note that treatment X being negatively associated with the success of the treatment does not mean
that a significant proportion of patients undergoing treatment X will see the treatment fail (77.4%
of them still recorded success). The negative association is stated as a comparison between the
two treatment types X and Y, where in this case treatment Y tends to produce more successful
outcomes.

2. We should be conscious of the choice of the word associated because we do not know if the
outcome of the treatment was entirely due to the treatment type received. The data we had came
from an observational study hence it might be erroneous for us to say that the type of treatment
and the outcome of the treatment have a causal relationship. It is important to see the distinction
between association and causation and for the rest of this chapter, we will be focussing on discussing
associative relationships between categorical variables rather than causal relationships.

Discussion 2.2.3 So how do we identify an association between two variables? Suppose the two vari-
ables we are considering represent two characteristics in a population. Let us call these two characteristics
A and B. For example, A could be smoker (so one categorical variable could be smoking habit, with
two categories smoker and non-smoker ) while B could be male (so the other categorical variable could
be gender, with two categories male and female). The population can be a well-defined group of peo-
ple. In the population, those “with A”, refers to smokers, while “without A”, denoted by NA refers to
non-smokers. Similarly, those “with B” refers to male and those “without B”, denoted by NB, refers to
female.
So if the rate of A given B (proportion of smokers among males) is the same as the rate of A given
NB (proportion of smokers among females), then it means that the rate of A is not affected by the
presence or absence of B. Thus in this case, there is no difference in the proportion of smokers between
both gender groups and we write
rate(A | B) = rate(A | NB).

However, if the rate of A given B is not the same as the rate of A given NB, then there are two
possible situations.

ˆ The first possibility is the rate of A given B is more than the rate of A given NB. This means that
the presence of A when B is present is stronger compared to when B is absent. Hence we say
that there is positive association between A and B. In this case, we write

rate(A | B) > rate(A | NB)

and for the gender/smoking example, this means that there is a higher proportion of smokers among
males than the proportion of smokers among females. So being male and smoking are positively
associated.

ˆ The other possibility is the rate of A given B is less than the rate of A given NB. This means that
the presence of A when B is present is weaker compared to when B is absent. Hence we say
that there is negative association between A and B. In this case, we write

rate(A | B) < rate(A | NB)

and for the gender/smoking example, this means that there is a lower proportion of smokers among
males than the proportion of smokers among females. So being male and smoking are negatively
associated.

The inequality rate(A | B) > rate(A | NB) (resp. rate(A | B) < rate(A | NB)) is not the only one
that allows us to conclude that there is positive (resp. negative) association between A and B. The table
below provides three other comparisons between rates that leads to the same conclusion of a positive
(resp. negative) association between A and B. These different comparisons are mathematically equivalent
to each other and their equivalence will be established using the symmetry rule in Discussion 2.3.1.
44 Chapter 2. Categorical Data Analysis

Establishing association
Positive association between A and B: Negative association between A and B:
(any of the following) (any of the following)
rate(A | B) > rate(A | NB) rate(A | B) < rate(A | NB)
rate(B | A) > rate(B | NA) rate(B | A) < rate(B | NA)
rate(NA | NB) > rate(NA | B) rate(NA | NB) < rate(NA | B)
rate(NB | NA) > rate(NB | A) rate(NB | NA) < rate(NB | A)

Example 2.2.4 Let us revisit the earlier example on two different treatments for kidney stones. Recall
that the two variables were treatment outcome and treatment type. For the treatment outcome variable,
let us split the patients into group A, which is the group of patients with successful outcomes and the
group NA will be those with unsuccessful outcomes. For the other variable, we will also split the patients
into group B for those given treatment X and group NB for those given treatment Y.
Let us revisit some conditional rates that were computed previously.
542
1. Rate(A | B) = rate(Success | X) = 700 = 0.774.
289
2. Rate (A | NB) = rate(Success | Y) = 350 = 0.826.

Since
rate(A | B) < rate(A | NB),
we can say that the presence of A when B is present is weaker than the presence of A when B is absent.
Thus there are fewer successful treatments when looking at treatment X compared to treatment Y and
hence success of the treatment is negatively associated with treatment X.
Conversely, since there are more successful treatments when looking at treatment Y compared to
treatment X, we can conclude that success of the treatment is positively associated with treatment Y.

Section 2.3 Two rules on rates

Discussion 2.3.1 In this section, we will discuss two important rules regarding rates. Suppose we have
a population with two population characteristics A and B. Among the population there are those who
possess characteristic A and those who do not.
For ease of notation, we will denote those who possess characteristic A simply as “A” and those who
do not as “NA”. Similarly for characteristic B, those in the population who possess this characteristic
will be denoted as “B” and those who do not as “NB”.
(Symmetry rule)
The first rule that we will be discussing is known as the symmetry rule. Although there are three
parts to this rule, once we can understand the first part, the second and third parts will follow naturally.

Symmetry Rule Part 1:

rate(A | B) > rate(A | NB) ⇔ rate(B | A) > rate(B | NA).

The above rule states that the rate of A given B is more than the rate of A given NB (call this
statement 1) if and only if the rate of B given A is more than the rate of B given NA (call this
statement 2). The if and only if here, denoted by ⇔, means that statements 1 and 2 happen together,
meaning that if one of the statements is true, the other one will also be true. In other words, the two
statements are either both correct or both incorrect. Another way of understanding (statement 1) if and
only if (statement 2) is
Section 2.3. Two rules on rates 45

ˆ If (statement 1) holds, then (statement 2) must hold; AND

ˆ If (statement 2) holds, then (statement 1) must hold.

Suppose we know that


rate(A | B) is more than rate(A | NB), (1)
then we can safely say that
rate(B | A) is more than rate(B | NA). (2)
Why is this so? Let us try to explain this logically.

1. If rate(A | B) is more than rate(A | NB), which is (1), then this means that there is a positive
association between A and B;

2. This means that we are more likely to see A when B is present, compared to when B is absent;

3. This in turn means that we are more likely to see B when A is present, compared to when A is
absent;

4. Hence rate(B | A) is more than rate(B | NA), which is (2).

This is the same as saying that A and B are positively associated. Conversely, suppose we know that

rate(B | A) is more than rate(B | NA), (2)

then we can safely say that


rate(A | B) is more than rate(A | NB). (1)
The logical explanation is similar in nature.

1. If rate(B | A) is more than rate(B | NA), which is (2), then this means that there is positive
association between B and A;

2. This means that we are more likely to see B when A is present, compared to when A is absent;

3. This in turn means that we are more likely to see A when B is present, compared to when B is
absent;

4. Hence rate(A | B) is more than rate(A | NB), which is (1).

We have now seen Part 1 of the Symmetry Rule. Parts 2 and 3, as shown below can be similarly
explained.

Symmetry Rule Part 1:

rate(A | B) > rate(A | NB) ⇔ rate(B | A) > rate(B | NA).

Symmetry Rule Part 2:

rate(A | B) < rate(A | NB) ⇔ rate(B | A) < rate(B | NA).

Symmetry Rule Part 3:

rate(A | B) = rate(A | NB) ⇔ rate(B | A) = rate(B | NA).

We will now present a mathematical derivation for Part 1 of the Symmetry Rule. You are encouraged
to go through the same process for Parts 2 and 3. Consider a 2 × 2 contingency table shown below
representing variables A and B. Let w, x, y, z denote the counts in each of the 4 cells.
46 Chapter 2. Categorical Data Analysis

B Not B Row total


A w x w+x
Not A y z y+z
Column total w+y x+z w+x+y+z
Symmetry Rule Part 1 states that

rate(A | B) > rate(A | NB) ⇔ rate(B | A) > rate(B | NA).

From the contingency table, the inequality on the left means


w x
>
w+y x+z
which implies w(x + z) > x(w + y),
or equivalently wx + wz > xw + xy,
and thus wz > xy.

On the other hand, the inequality on the right means


w y
>
w+x y+z
which implies w(y + z) > y(w + x),
or equivalently wy + wz > yw + yx,
and thus wz > xy.

Hence, both inequalities are in fact equivalent to wz > xy and thus they are equivalent.

Example 2.3.2 Let us revisit our kidney stones treatment example. The 2 × 2 contingency table below
gives us the number of patients in each treatment type as well as the number of success and failure
outcomes for each treatment type.
Outcome
Success Failure Row Total
Treatment
X 542 158 700
Y 289 61 350
Column Total 831 219 1050
We have earlier shown that A (representing treatment outcome) is associated with B (representing
treatment type) since

rate(A | B) = rate(Success | X)
= 0.774
< 0.826 = rate(Success | Y) = rate(A | NB).

By symmetry rule part 2, we should have rate(B | A) < rate(B | NA). Let us verify that this is indeed
the case.

rate(B | A) = rate(X | Success)


542
= (since there are 831 successful cases of which 542 came from treatment X)
831
= 0.652
rate(B | NA) = rate(X | Failure)
158
= (since there are 219 failure cases of which 158 came from treatment X)
219
= 0.721.

Since 0.652 < 0.721, we have thus verified that rate(B | A) < rate(B | NA) as predicted by symmetry
rule part 2. This also confirms that there is negative association between success of treatment (A) and
treatment X (B).
Section 2.3. Two rules on rates 47

Discussion 2.3.3 (Basic rule on rates.) The second rule on rates is known as the basic rule on
rates. The main rule, as well as three consequences of the main rule are shown below.

Basic rule on rates:

The overall rate(A) will always lie between rate(A | B) and rate(A | NB).

Consequence 1:

The closer rate(B) is to 100%, the closer rate(A) is to rate(A | B).

Consequence 2:

If rate(B) = 50%, then rate(A) = 21 [rate(A |B) + rate(A | NB)].

Consequence 3:

If rate(A | B) = rate(A | NB), then rate(A) = rate(A | B) = rate(A | NB).

1. The basic rule on rates states that the overall rate(A) is always between the conditional rates of A
given B and A given not B.

2. The first consequence gives us a little more indication of where the overall rate(A) is going to
be. If rate(B) is closer to 100% (than rate(NB)), then rate(A) is going to be closer to rate(A | B)
compared to rate(A | NB).

3. The second consequence specifically states that if rate(B) is exactly 50%, then rate(A) will be
exactly the mid point between rate(A | B) and rate(A | NB).

4. Finally, the third consequence states that if the two conditional rates, namely
rate(A | B) and rate(A | NB) are the same, then the overall rate(A) will also take the same value
of the two conditional rates.

At this point, the significance of the basic rule and its consequences are not immediately apparent or
intuitive. The best way towards understanding them is through an example.

Example 2.3.4 Suppose a school has two different classes (call them class Bravo and class Charlie) of
students who took the same data analytics examination. We are interested in studying the passing rate
of students at the entire school level and also at each individual class level. Suppose we are given the
following information:

(1) The passing rate of students from Bravo is 75%.

(2) The passing rate of students from Charlie is 55%.

For convenience, let us denote class Bravo as “B” and class Charlie as “NB” (not B). Similarly, denote
passing the examination as “A” and not passing as “NA”. So the two pieces of information we have
above are:
rate(A | B) = 0.75 and rate(A | NB) = 0.55.

By the basic rule on rates, the overall passing rate of all students from the school, that is, rate(A) is
between the two conditional rates,

0.55 = rate(A | NB) ≤ rate(A) ≤ rate(A | B) = 0.75.


48 Chapter 2. Categorical Data Analysis

rate(A)? rate(A)? rate(A)?

0.55 0.75
rate(A | NB) rate(A | B)

However, without any further information, we will not be able to determine the exact value of rate(A).
What about the three consequences of the basic rule?

1. The first consequence states that the closer rate(B) is to 100%, then the closer rate(A) will be to
rate(A | B). In our example, it means that if the number of students in Bravo is far more than the
number of students in Charlie (thus rate(B) is closer to 100%), then the overall passing rate of the
school (that is, rate(A)) will be closer to the passing rate of class Bravo (that is, rate(A | B)).

rate(A)

0.55 0.75
rate(A | NB) rate(A | B)

2. The second consequence states that if rate(B) = 50% (which also means that rate(NB) = 50%),
then
1
rate(A) = [rate(A | B) + rate(A | NB)] .
2
That is, rate(A) will be right in between the two conditional rates. In our example, this means
that if the number of students in Bravo and Charlie are exactly the same, then the overall passing
rate of the school will be exactly in between 0.55 and 0.75, that is 0.65.

0.55 0.65 0.75


rate(A | NB) rate(A) rate(A | B)

3. The third consequence states that if the two conditional rates rate(A | B) and
rate(A | NB) are the same, then the overall rate(A) will be the same value as the two conditional
rates. In our example, if the passing rates of class Bravo and class Charlie are the same, then the
overall passing rate of the school will be the same as the passing rate in either class.

Example 2.3.5 Let us continue with Example 2.3.4 and validate the basic rule on rates and the con-
sequences by considering actual numbers.

1. Suppose the total number of students and the number of passes in each of the two classes are given
in the table below.

Total number of students Number of passes Passing rate


450
Bravo 600 450 600 = 0.75
44
Charlie 80 44 80 = 0.55
494
School 680 494 680 = 0.73

Notice that the passing rates of both classes are what they are supposed to be, but the number of
students in Bravo far exceeds the number in Charlie (so rate(B) is 600 680 which is closer to 100%).
While the overall school passing rate is between 0.55 and 0.75 (in accordance to the basic rule on
rates), it is much closer to the passing rate of Bravo, as predicted by consequence 1.
Section 2.3. Two rules on rates 49

2. Suppose the total number of students and the number of passes in each of the two classes are as
given below instead:

Total number of students Number of passes Passing rate


450
Bravo 600 450 600 = 0.75
330
Charlie 600 330 600 = 0.55
780
School 1200 780 1200 = 0.65

Again, the passing rates of both classes are what they are supposed to be, but in this case, the
number of students in Bravo and Charlie are the same (so rate(B) = rate(NB) = 0.5). As predicted
by consequence 2, the overall school passing rate will be 0.65, which is right in between the two
class passing rates.

3. To illustrate consequence 3, suppose the passing rates for both classes are the same, as shown
below.

Total number of students Number of passes Passing rate


450
Bravo 600 450 600 = 0.75
300
Charlie 400 300 400 = 0.75
750
School 1000 750 1000 = 0.75

Now the two conditional rates, namely rate(A | B) and rate(A | NB) are equal. By consequence 3,
rate(A) will be the same value as the two conditional rates. This is indeed the case as we see that
the two classes have the same passing rate which will result in the school having the same passing
rate of 0.75. It is important to note that we do not require rate(B) to be the same as rate(NB)
for consequence 3 to hold. For our example, this means that we do not require classes Bravo and
Charlie to have the same number of students. As long as the two class passing rates are the same,
consequence 3 will hold.

Finally, let us verify the rule on rates using our kidney stones data set.

Example 2.3.6 We have seen the following table from Example 2.3.2.

Outcome
Success Failure Row Total
Treatment
X 542 158 700
Y 289 61 350
Column Total 831 219 1050

ˆ The conditional rates of success among the two treatment types are:
542
rate(Success | X) = = 0.774,
700
289
rate(Success | Y) = = 0.826.
350
The overall rate of success is
831
rate(Success) = = 0.791,
1050
which is closer to the rate of success among patients with treatment X. This agrees with the basic
rule on rates since there were more patients with treatment X (66.67%) compared to treatment Y
(33.33%).

In the three sections of this chapter we have discussed so far, we have seen how we can use the concept
of rates to investigate relationships, in particular, association between categorical variables. Very often,
exact rates (overall or conditional) are unknown to us but if we can apply some general rules like the
50 Chapter 2. Categorical Data Analysis

symmetry rule, basic rule or the consequences of the basic rule, we can still obtain valuable insights into
the data set we have on our hands. Making the best use of limited information is an important skill
when analysing data.
In the next section, we will discuss a surprising observation that can be counterintuitive to some but
is very important for anyone analysing data to be aware of.

Section 2.4 Simpson’s Paradox

Discussion 2.4.1 From earlier sections, when faced with the problem of advising our new kidney stones
patient, we have gone through two cycles of the PPDAC process.
The first question we asked was whether having any sort of treatment was better than not having
one. By comparing the rate of success (0.791) versus the rate of failure (0.209), we conclude that there
are many more successful than failed treatments from past records, so the decision was to advise our
new patient that he should be treated.
But this led us to the next question, as there are two treatment types available, which type of
treatment should we recommend? This made us go back to the data and compare the success rates of
those patients who were given treatment X as opposed to the success rates of those given treatment Y.
Upon delving deeper into the data, we discovered that treatment Y is positively associated with success
rate. This suggests that treatment Y is “better” than treatment X and perhaps we should advise our
patient to undergo treatment Y.
Are we done with our analysis? Is there some lingering doubt in our minds that we may be providing
wrong advice to our patient? If we are convinced that treatment Y is better, should we always send
kidney stone patients for treatment Y? If not always, then when do we do so? What should our decision
be based on? These are again questions that prompt us to go back to our data and see if more information
can be obtained from it.

Size of stone Gender of patient Treatment type (X or Y) Outcome


Large Male X Success
Large Male X Success
Small Male Y Success
Large Male Y Failure
Small Male X Success
Large Male Y Success

The table above from Example 2.1.1 shows there are two other variables that we have not used in our
analysis thus far, namely the size of the kidney stone and also the gender of the patient. Would these
variables be an important factor in our consideration? How should we go about analysing them? What
type of visualisation would be useful when doing such analysis? Let us begin by exploring the stone size
variable.

Example 2.4.2 (Analysing 3 categorical variables using a plot.) In Example 2.1.9, we used a
stacked bar plot for “Outcome” by “Treatment” to compare the success rates for treatments X and Y.
Section 2.4. Simpson’s Paradox 51

From the stacked bar plot, we have concluded that treatment Y is positively associated with success.
We have not taken stone sizes into consideration thus far and the plot was made based on simply counting
the number of successes and failures across all stone sizes. In other words, this plot gave us the overall
success rates of treatments X and Y.

Let us now separate the data by considering the categorical variable of “stone size” which has two
categories, namely large stones and small stones.

Large stones Success Failure Total


X 381 145 526
Y 55 25 80
Total 436 170 606

The table above shows the outcome of treatments given to patients with large stones. For example,
out of 526 treatment X patients with large kidney stones, 381 had a successful outcome and 145 were
unsuccessful. Similarly, out of 80 treatment Y patients with large kidney stones, 55 were successful while
25 were not. We can present these information using a stacked bar plot like before, as shown below2 .

2 Notice that for the bar plot for treatment Y, the two percentages do not add up to 100%. This is due to rounding off

in Excel, where the success percentage is in fact 68.75% and the failure percentage is 31.25%.
52 Chapter 2. Categorical Data Analysis

How do the two different treatments compare? Although the margin of difference is not very big,
there is no doubt that treatment X has a higher success rate of 72.4% compared to treatment Y, which
has a success rate of 68.8%. This means that, for treating large kidney stones,
381 55
= 0.724 = rate(Success | X) > rate(Success | Y) = 0.688 = ,
526 80
and thus treatment X is positively associated with success for treating large stones. This observation is
surprising, since we have already concluded that for all stone sizes combined together,

rate(Success | X) < rate(Success | Y),

that is, treatment X is negatively associated with success if we do not segregate by stone size.
Why are we observing a different behaviour for large kidney stones as opposed to what we saw earlier
when all kidney stone sizes are combined?
Let us consider the data for small kidney stones.

Small stones Success Failure Total


X 161 13 174
Y 234 36 270
Total 395 49 444

The table above shows the outcome of treatments given to patients with small stones. For example,
out of 174 treatment X patients with small kidney stones, 161 had a successful outcome and 13 were
unsuccessful. Similarly, out of 270 treatment Y patients with small kidney stones, 234 were successful
while 36 were not. Let us again present these data using a stacked bar plot.

The margin of difference between the two treatment types is again not very big, but again we see
that treatment X has a higher success rate of 92.5% compared to treatment Y, which has a success rate
of 86.7%. This means that, for treating smaller kidney stones,
161 234
= 0.925 = rate(Success | X) > rate(Success | Y) = = 0.867,
174 270
so again treatment X is positively associated with success for treating small stones, which is the opposite
from what we had when the data was combined and not segregated by stone size.
We can now combine the two previous plots by putting them side by side as shown below.
Section 2.4. Simpson’s Paradox 53

Notice that the first two bars from the left are for large kidney stones data while the last two bars
are for small kidney stones. This type of plot is sometimes referred to as a sliced stacked bar plot. Such
a plot can be used for comparing across three categorical variables. The three variables here are stone
size, treatment outcome and treatment type.
We are now facing a paradox. Although treatment Y is the better treatment overall, when the stone
sizes are combined and not segregated, we see that if we focus only on the large stones, or only on
the small stones, treatment X is observed to have higher success rate that treatment Y. This is indeed
strange!
This phenomenon is known as Simpson’s Paradox .

Simpson’s Paradox:

Simpson’s Paradox is a phenomenon in which a trend appears in more than half of


the groups of data but disappears or reverses when the groups are combined. Here,
“disappears” means the two variables in question (say A and B) are no longer associated,
that is, rate(A | B) = rate(A | NB).

We are now back to the same question which we thought we have already answered: Which treatment
is better for our patient? Should we advise him to undergo treatment X or Y?

Remark 2.4.3 In the example of kidney stones, there were only two subgroups for the stone size,
namely, small and large. We claim that Simpson’s Paradox was observed because the trend in both
subgroups is different from the trend observed when the subgroups are combined.
In examples where there are more than two subgroups, we will say that Simpson’s Paradox is observed
as long as a majority of the individual subgroup rates shows the opposite trend to the overall rate. For
example, if there are three subgroups, as long as there are at least 2 subgroups showing the opposite
trend to the overall rate, we can say that Simpson’s Paradox is observed.

Example 2.4.4 (Analysing 3 categorical variables using a table.) Let us put the two tables in
Example 2.4.2 for both the large and small kidney stones together into one unified table.

Large stones Small stones Total (Large+Small)


Succ. Total R(succ.) Succ. Total R(succ.) Succ. Total R(succ.)
trt. in % trt. in % trt. in %
X 381 526 72.4% 161 174 92.5% 542 700 77.4%
Y 55 80 68.8% 234 270 86.7% 289 350 82.6%

To recap, we have two different treatment types, X and Y. In the row for treatment X, we see that there
were 526 large stones cases that were under treatment X, of which 381 were successful. This gives a
success rate of 72.4%. Similarly, in the row for treatment Y, we see that there were 270 small stones
54 Chapter 2. Categorical Data Analysis

cases that were under treatment Y, of which 234 were successful. This gives a success rate of 86.7%.
The last 3 columns of the table gives the combined numbers for both stone sizes.
Recall we had initially concluded that treatment Y was the better treatment because 82.6% of
patients who were given treatment Y had a successful outcome, compared to 77.4% for treatment X. We
then separated the cases according to the size of the stone, i.e., we created subgroups and this method
of subgroup analysis is called slicing.
This is when we observed Simpson’s Paradox, where the rate of success amongst small (92.5% )
and large (72.4%) stones is higher for treatment X compared to treatment Y. This reverses the trend
observed when the small and large kidney stones were combined.
Let us look at the numbers highlighted in blue in the table. A crucial observation at this point is that
treatment X seems to be used to treat mostly patients with large stones as compared to small stones.
Thus, by the basic rule on rates, we know that the overall success rate of treatment X will be closer
to the large stones success rate of 72.4% than the small stones success rate of 92.5%. Indeed, we have
the overall treatment X success rate to be 77.4%.
Turning our attention to the numbers highlighted in orange in the table, we observe the opposite of
the above. Treatment Y seems to be used to treat patients with small stones compared to large stones.
Again, by the basic rule on rates, we would expect the overall success rate of treatment Y to be closer
to the small stones success rate of 86.7% than the large stones success rate of 68.8%. Indeed, we have
the overall treatment Y success rate to be 82.6%.
Combining these two observations, it is no wonder that we have the overall success rate of X to be
lower than the overall success rate of Y.
Another very telling observation from the table is that the range of success rates for treating large
stones is between 68.8% (treatment Y) and 72.4% (treatment X). Compare this with the range of success
rates for treating small stones which is between 86.7% (treatment Y) and 92.5% (treatment X). This
tells us that treatments for large stones have a lower rate of success compared to small stones, which is
not unreasonable to believe.
In conclusion, we can explain Simpson’s Paradox in the following way. Treatment X is in fact a better
treatment than Y. However, because patients have been using Treatment X to treat more difficult cases
(large kidney stones), this lowers the overall success rate of treatment X. It does not change the fact that
in the individual subgroups, regardless of stone size, treatment X achieves a higher success rate than
treatment Y. Slicing the data into the small and large stone subgroups will reveal that treatment X is
indeed a better treatment.
Before we conclude this section, let us recap the story so far.

ˆ We started off with a new kidney stone patient coming to us for advice. Based on past patient
records, we were convinced that the success rate of undergoing treatment is higher than the failure
rate and thus conclude that the patient should undergo some form of treatment.

ˆ We were then faced with the decision between two treatment types. Treatment X and Treatment
Y. In determining which treatment type to recommend to the patient, we looked at the data on
hand of past patients and investigated if there was any association (positive or negative) between
Treatment X and Treatment Success.

Association?
Treatment X Success

ˆ With further analysis, we found that Treatment X was negatively associated with Success. This
meant that we should recommend Treatment Y to our new patient. However, through another
iteration of the PPDAC cycle, we wondered how another variable like stone size may affect our
conclusion.
Section 2.5. Confounders 55

ˆ By slicing, we segregated our data into past patients with large stone size and others with small
stone size. Surprisingly for both subgroups, we found that Treatment X had a higher success rate
than Treatment Y. This reversed that trend that we saw when the subgroups were combined.

ˆ More importantly, we observed that Treatment X was used more often in dealing with large stones
compared to Treatment Y, which was more frequently used to deal with small stones. This means
that large stone size is likely to be associated with Treatment X.

Association?
Treatment X Success

Association

Large stone size

ˆ On the other hand, we also observed that patients with large stones have a lower success rate
(regardless of treatment type) compared to patients with small stones. This is perfectly reasonable
and thus also suggests that large stone size is likely to be associated with treatment success.

Association?
Treatment X Success

Association Association

Large stone size

ˆ This means that stone size is a (third) variable that was associated with the other two variables
whose relationship we were initially investigating, thus affecting the conclusion of our initial study.
Such a variable is called a confounder and they will be the focus of our discussion in the next
section.

ˆ For now, we will note that when Simpson’s Paradox is observed, it implies that there is definitely a
confounding variable present, that is a third variable that is associated with the two variables whose
relationship we are investigating. However, the existence of a confounder does not necessarily lead
to us observing Simpson’s Paradox.
56 Chapter 2. Categorical Data Analysis

Section 2.5 Confounders

Discussion 2.5.1 Continuing our kidney stones patients example, we were fortunate that the data
set contained information that may not seem to be important initially. Without performing further
investigation into the size of the kidney stones, we could have ended up giving the wrong recommendation
to our new patient.
In data collection, it is often important to collect more information on the subjects in addition to
those variables that are immediately apparent to be of importance. This is because we can never be
sure whether there would be some other variables that may be confounders that would influence our
study of association between two variables of interest. Of course, as the owner of the study, we can ask
our subjects (for example, in a survey) as many questions and collect as much data as we want, but
practically, we also know that respondents do not like to see a long list of seemingly unrelated questions
in surveys. There are also cost considerations if we collect more data than necessary. To design a good
study, we need to strike a balance between the two.

Definition 2.5.2 A confounder is a third variable that is associated with both the independent and
dependent variables whose relationship we are investigating. Note that we do not specify the direction
(positive or negative) of association here. As long as the variable is associated in some way to the main
variables, we will call it a confounder, or a confounding variable.

Example 2.5.3 At the end of the previous section, we explained how the variable kidney stone size
is a confounding variable because it is associated with both the (independent) variable Treatment type
and (dependent) variable Treatment outcome. Let us now work through the calculations to justify these
associations. First, let us show that stone size is associated with treatment type.

Treatment Large Small Total


X 526 174 700
Y 80 270 350
Total 606 444 1050

The table shows the number of large and small stones treated by treatments X and Y respectively.
Out of 700 cases treated by treatment X, 526 were large stones and 174 were small stones. Out of 350
cases treated by treatment Y, 80 were large stones and 270 were small stones. Since
526 80
rate(Large | X) = = 0.751 and rate(Large | Y) = = 0.229,
700 350
we see that
0.751 = rate(Large | X) > rate(Large | Y) = 0.229,
and so large stones are positively associated with treatment X. This means that there is a higher pro-
portion of large stones being treated by treatment X compared to treatment Y.
Now let us turn our attention to the association between stone size and treatment outcome.
Stone size Success Failure Total
Large 436 170 606
Small 395 49 444
Total 831 219 1050

This table shows the number of success and failure outcomes for patients with large and small stones.
Out of 606 large stones cases, 436 were successfully treated while 170 were not successful. Out of 444
small stones cases, 395 were successfully treated while 49 were not successful. Since
436 395
rate(Success | Large) = = 0.719 and rate(Success | Small) = = 0.890,
606 444
we see that
0.719 = rate(Success | Large) < rate(Success | Small) = 0.890,
Section 2.5. Confounders 57

and so large stones are negatively associated with success outcome. This means that there is a lower
proportion of successful outcomes for large stones cases compared to small stones cases.
As we have now shown that stone size is associated with both the treatment type and the treatment
outcome, we are convinced that stone size is a confounding variable that needs to be managed. The way
to do it, as shown previously is to use slicing, where we segregate the data by the confounding variable.
This is done by investigating the association between the dependent and independent variables for large
stone cases separately from the small stone cases.

Discussion 2.5.4 We have now seen the benefits of having more information on the subjects because it
allows us to identify confounding variables which would not have been possible if, for example, information
of stone size was not available or collected. Thus, an important learning point when it comes to designing
a study is to measure and collect data on additional variables that we feel may be relevant in our study.
Whether these additional variables turn out to be confounders or not we would have to probe further,
but we will never know if we do not collect data on them in the first place.
That said, we have to come to terms with the fact that most of the time, collecting information on
variables is costly in practice. Even if we do manage to collect all the information we need, the analysis
can be complicated if the data needs to be sliced along many different variables.
For non-randomised designs like observational studies, it is usually the case that the two groups that
we are comparing are not “identical” except for the treatment. Despite our best efforts, we can never
be totally sure that every single confounder has been identified and controlled for. Thus, observational
studies offer only a limited conclusion in providing evidence of association and not causation.

(Randomisation as a preferred solution to confounding.) An alternative approach to address


potential confounders is to rely on a strategy that was discussed in Chapter 1: randomised assignment.
Let us discuss how this in done in detail, using our much developed example on kidney stones treatment.

Example 2.5.5 Fundamentally, confounding variables occur due to association which is a consequence
of having unequal proportion of variables in the two groups that we are trying to compare. For the kidney
stones example, stone size was a confounder because patients with large stones were disproportionately
allocated to treatment X instead of treatment Y. Now, if the allocation of large (and small) stone size
cases to the two treatment types was done randomly, which tends to result in an equal proportion across
the two groups, there would no longer be any association between stone size and treatment type. In this
case, stone size would no longer be a confounder. Note that a confounding variable is associated to both
the independent and dependent variables, so removing one of the associations is enough to remove the
confounding variable.

Association?
Treatment X Success

No Association Association

Large stone size

How can we achieve randomised assignment of patients to the two treatment types? One simple way
is, for example, to toss a fair coin when deciding which treatment a patient will be given. Surely, such a
method of randomised assignment tends to give us approximately equal proportions of large (and small)
58 Chapter 2. Categorical Data Analysis

stone cases across the two treatment types. If we have sufficiently many patients to assign to either
treatment types, the two groups of patients assigned to treatment X and treatment Y will tend to be
similar in all characteristics, including stone size.
Surely this addresses the problem of confounders appropriately right? Unfortunately, randomisation
is not always possible in every study. Imagine the scenario where the type of treatment given to each
patient is dependent on a coin toss! Would you agree to this if you were one of the patients? Certainly
not! Patients usually have the right to choose which treatment group they want to be in and this would
make the assignment process non-random. Such ethical issues could very well constrain and prevent us
from performing randomised assignment of our subjects. In such a situation, we have no choice but to
fall back on the method of slicing for suspected confounders.

With this, we conclude Chapter 2, where we discussed, in detail how we can use rates to study the
association between two (or more) categorical variables. We learnt about Simpson’s Paradox which
led us eventually to the issue of confounders and how they can be managed. In the next Chapter, we
will turn our attention to the other variable type, namely numerical variables.
Exercise 2 59

Exercise 2
1. On 19 June 2021, The Straits Times published the figure below, taken from a population census
of Singapore. Each household may only belong to a single category.

What can be said about the resident households, earning more than $6,999 from work? From the
following statements, select all that apply.

(A) A majority of resident households are earning more than $6,999 from work in 2020.
(B) A larger proportion of resident households are earning more than $6,999 from work in 2020,
as compared to 2010.
(C) rate(Income > $6,999 | 2020) > rate(Income > $6,999 | 2010). Here “Income” represents
Household monthly income from work.
(D) rate(Income > $6,999 | 2020) < rate(Income > $6,999 | 2010). Here “Income” represents
Household monthly income from work.

2. A researcher collected data on his study subjects. Unfortunately, he spilled coffee on his table of
values, resulting in some missing information. Based on the remaining information in the table,
calculate rate(Male) in the study.

X Y Row total
Female 300 100
Male 50
Column total 500

(A) rate(Male) = 0.2.


(B) rate(Male) = 1.
(C) rate(Male) = 0.7.
(D) rate(Male) = 0.1.

3. For the year 2020, the marginal death rate of country A is greater than the death rate among
the females of country A, or in other words, rate(Death) > rate(Death | Female). Which of the
following statements must be true in country A for the year 2020?

(A) rate(Male) < rate(Male | Death).


(B) rate(Male) > rate(Male | Death).
(C) rate(Male) = rate(Male | Death).
60 Chapter 2. Categorical Data Analysis

4. Categorical variables A and B are associated with each other. This means that

rate(A | B) ̸= rate(A | not B).

Based on the given information, which of the following statements is/are always true?

(I) rate(B | A) ̸= rate(B | not A).


(II) rate(B | A) = rate(A | B).
(III) rate(A | B) ̸= rate(not A | B).

(A) Only (II).


(B) Only (II) and (III).
(C) Only (I).
(D) All the statements are true.

5. The following data is coming from a survey done on the effectiveness of the coaching sessions on
job hunting for fresh graduates. In this survey, three questions were asked of the participants:

ˆ Q1: How much is their salary or if they are unemployed?


ˆ Q2: If they have received coaching? (Answer YES or NO.)
ˆ Q3: If their job was a continuation of their internship? (Answer YES or NO.)

The table below summarises the answers received for the three questions.

Q2: NO Q2: YES


Monthly Salary Q3: YES Q3: NO Q3: YES Q3: NO Total
> $4000 26 85 8 28 147
Q1:
< $3000 5 31 0 7 43
$3000 − $4000 41 181 13 48 283
Unemployed 604 141 745

Based on the data provided above, which of the following statements is/are correct?

(I) There is an association between receiving coaching and having a salary above $4000.
(II) There is an association between receiving coaching and landing a job in continuation of an
internship.

(A) Only (I).


(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).

6. The contingency table below shows the classification of hair descriptions of students studying in
an international school in Singapore.

Hair type
Straight Curly
Hair colour Male Female Male Female Total
Red 7 9 8 5 29
Brown 35 20 12 16 83
Blonde 51 55 38 27 171
Black 22 25 19 24 90
Total 115 109 77 72 373
Exercise 2 61

The marginal rate, rate(Curly), is %; while the joint rate, rate(non-Black and Female)
is %. Give each answer as a percentage correct to 2 decimal places.

7. A group of market researchers were commissioned to investigate the relationship between two food
delivery companies (Grabfood and Foodpanda) and their punctuality of deliveries (whether they
are on time or late). The following chart was used by the researchers to aid in the presentation of
their findings.

Which of the following statements is/are true based on the information given above? Select all that apply.

(A) Grabfood is positively associated with being on time for food deliveries.
(B) Grabfood is positively associated with being late for food deliveries.
(C) Foodpanda is positively associated with being on time for food deliveries.
(D) Foodpanda is positively associated with being late for food deliveries.

8. A newspaper article had a headline “30% of local university students admitted last year graduated
from a polytechnic”. Assume there are only 2 universities (Uni A and Uni B). In Uni A, 50% of
its local students admitted last year graduated from a polytechnic. In Uni B, the percentage of its
local students admitted last year who graduated from a polytechnic must be

(A) more than 50%..


(B) 40%.
(C) between 30% and 50%.
(D) less than 30%.

9. By “elderly”, we mean a person who is more than 65 years old. In Singapore, the percentage
of elderlies among women is higher than the percentage of elderlies among men. Which of the
following statements must be true?

(I) In Singapore, the percentage of women among elderlies is higher than the percentage of women
among the non-elderlies.
(II) In Singapore, the percentage of women is higher than the percentage of men among elderlies.

(A) Only (I).


62 Chapter 2. Categorical Data Analysis

(B) Only (II).


(C) Both (I) and (II).
(D) Neither (I) nor (II).

10. Su is investigating the association between blood pressure and workaholism in a certain population.
Someone who works more than 75 hours per week is considered a workaholic.
The income level and blood pressure (high or normal) for each subject and whether or not they
are classified as workaholic is recorded and summarised in the table below. Here “HBP” denotes
“high blood pressure” while “NBP” denotes “normal blood pressure”.

Income Group
Low Middle High
HBP NBP HBP NBP HBP NBP
Workaholic 25 75 23 87 26 134
Non-workaholic 25 80 18 72 9 51

Consider the “Low” income level group.


rate(HBP | Workaholic) is (1) % while rate(HBP) is (2) %.
Fill in the blanks in the statement above, giving your answers as percentages correct to the nearest
whole numbers.

11. The Lord of the Rings: The Fellowship of the Ring was released in December 2001. Suppose that

(I) Among the people in Singapore who were born before 2000, 10% watched the film.
(II) Among the people in Singapore who were born during or after 2000, 20% watched the film.

Choose the best option below. Among all the people in Singapore, the percentage who watched
the film .

(A) must be 15%.


(B) must be between 10% and 20%.
(C) can be less than 10%.
(D) can be more than 20%.

12. Darren is planning a surprise Bubble Tea Party for his class of 30 students during the last tutorial
of GEA1000. Each student chooses either milk tea or fruit tea (but not both). The following
information is what he has gathered about his tutorial class:

ˆ Of the 30 students, 40% are males.


ˆ 60% of the students who drink milk tea at this party are males.
ˆ 70% of the students who drink fruit tea at this party are females.

Which of the following statements can be concluded from the above information about bubble tea
consumption in Darren’s tutorial class?

(I) There is positive association between being male and drinking milk tea.
(II) The majority of the 30 students are fruit tea drinkers.

(A) Only (I).


(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).
Exercise 2 63

13. A researcher was studying the effects of Paxlovid for COVID-19 treatment on a group of patients.
The researcher took notes on whether the treatment was successful or not (Success/Failure), and
the gender of each patient (Male/Female).
Which of the following statements is/are correct?

(I) rate(Success | Male) refers to the proportion of successful treatments that are males.
(II) rate(Failure & Male) refers to the proportion of males that had a failed treatment.
(III) rate(Female) refers to the proportion of females among the patients.

(A) Only (III).


(B) Only (I) and (II).
(C) Only (II) and (III).
(D) All three statements are correct.

14. Suppose that 70% of the male graduates in 2018 from university GER got married by 2022. In
addition, 5000 of all 6250 graduates in 2018 from university GER got married by 2022. The
percentage of female graduates in 2018 from university GER who got married by 2022 is

(A) lower than 70%.


(B) between 70% and 80%.
(C) higher than 80%.
(D) impossible to determine from the information given.

15. An incomplete breakdown of students in University S by its four faculties and by sex is as follows:

Male Female Row Total


Engineering 700
Arts 250 400 650
Architecture 50 50 100
Science 550
Column Total 1200 800 2000

Based on the above, which of the following deductions must be true? Select all that apply.

(A) rate(Arts) is closer to rate(Arts | Male) than to rate (Arts | Female).


(B) There is a negative association between Arts and Male.
(C) rate(Science) is closer to rate(Science | Female) than to rate (Science | Male).
(D) There is no association between Architecture and Female.

16. A researcher conducted a study to investigate if the usage of ChatGPT is associated with the passing
of an exam. He found the rate(pass | usage of ChatGPT) to be equal to 0.4 and rate(fail | no usage of
ChatGPT) to be equal to 0.3. Which of the following statements is/are true? Select all that apply.

(A) There is insufficient information to make a conclusion about the association between the two
variables.
(B) The usage of ChatGPT is positively associated with failing the exam.
(C) Passing the exam is negatively associated with the usage of ChatGPT.
(D) Failing the exam is positively associated with no usage of ChatGPT.

17. Ron is studying the reading habits of a certain group of adults in relation to their sex and whether
they wear spectacles. Some of the data he collected are presented in the incomplete contingency
table below. In addition, you are given that rate(Wear spectacles) = 0.5.
64 Chapter 2. Categorical Data Analysis

Males Females
Read books Wear Do not wear Wear Do not wear Total
weekly spectacles spectacles spectacles spectacles
Yes 72 65
No 20 34
Total 239 480

Fill up the missing cells in the table where possible and determine which of the following statements
must be true. Select all that apply.

(A) rate(Yes and Wear spectacles) = 61/240.


(B) rate(Do not wear spectacles and Females) = 31/96.
(C) rate(No | Males) cannot be determined.
(D) rate(Females | Wear spectacles) = 42/120.

18. To tackle the rise in dengue infections among residents in Compass Condominium, the building
management sent out a survey form to all of its 100 households to find out water disposal habits
among its residents. Households were randomly assigned into only one of the two modes: hard-copy
surveys placed in the residents’ mailboxes, and email surveys sent to the residents’ registered email
addresses. Based on the above random assignment, 55 of the 100 households were given hard-copy
surveys in their mailboxes. Half of those who responded did so using email surveys, while 40%
of those who did not respond were sent email surveys. What can we conclude from the above?
Select all that apply.

(A) The response rate of the survey is 50%.


(B) There is a negative association between those who received hard-copy surveys and those who
responded.
(C) 50% of those who were asked to do email surveys responded.
(D) The rate of response among email surveys is greater than the overall rate of response.

19. Consider the following partial contingency table that gives the breakdown of students by gender
in department A and department B of a local university. We are told that there is no association
between gender and department. The total number of students in both departments is .

Male Female Row Total


A 30 90
B 140

20. The table below provides the number of all the teachers employed in the different institutions in
2021 in Singapore, categorised by age and sex.
Exercise 2 65

Using only the information in the table, fill in the blanks below (giving your answers to 2 decimal
places). Among all the teachers aged 30-39, (1) % of them were male secondary school
teachers. In addition, (2) % of female primary school teachers were aged 45 years old
and above.

21. Suppose that in a population, it is known that within males and within females, smoking and binge
drinking are positively associated. However, Simpson’s Paradox is observed when the male and
female subgroups are combined. From the statements below, select all that are true.

(A) Overall rate(Binge drinker | Smoker) ≤ overall rate(Binge drinker | Non-smoker).


(B) Overall rate(Smoker | Binge drinker) > overall rate(Smoker | Non-binge drinker).
(C) Overall rate(Smoker | Binge drinker) ≤ overall rate(Smoker | Non-binge drinker).
(D) Overall rate(Non-binge drinker | Smoker) ≥ overall rate(Non-binge drinker | Non-smoker).

22. A researcher wants to find out if drinking tea helps to reduce memory loss. He interviewed 100
elderly citizens from an Elder Care Center and inquired if they were tea drinkers. 60 of them were
classified as tea drinkers, while the remaining 40 were not. He then asked them to play a specific
memory game to test their memory. The researcher also noted that a potential confounding variable
was “gender”. To control for this potential confounder (gender), the researcher could perform

(A) double blinding.


(B) random assignment.
(C) slicing of the data.

23. The table below shows male and female patients undergoing two treatment types, X or Y . The
outcome of the treatment is designated as either successful or unsuccessful. The success rates of
the respective treatments across genders are also calculated.

Male Female
Patients Succ. # Suc. Rate Patients Succ. # Succ. Rate
X ? ? 50% 40 32 80%
Y ? ? ? ? ? 60%
Total 100 50 50% ? ? ?

Unfortunately, some of the data is missing. We know that all missing values are non-zero. Which
of the following statements must be true?

(I) Simpson’s Paradox is observed when the subgroups of Treatment X and Treatment Y are
combined, when considering the relationship between gender and outcome.
(II) Treatment type is a confounder between the variables gender and outcome.
66 Chapter 2. Categorical Data Analysis

(A) Only (I).


(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).

24. A tuition agency is interested to see whether teaching English via a new online platform is more
effective compared to their current teaching methods for Primary 3 students. Every child has a
choice whether he/she wants to enrol in the class that teaches using the online platform (denoted
as Class A) or in the class without the online platform (denoted as Class B) and parental consent
is also obtained. At the end, all students are given an English test and are awarded an “S” grade
if they pass the test. The table below provides some of the information.

Class A Class B
Number S-grade Rate (%) Number S-grade Rate (%)
Males 200 40 20 100 10 10
Females w x y z

You are given the following rates (for males) as shown in the table above:
40
rate(S-grade | Class A) = 200 = 20%.

10
rate(S-grade | Class B) = 100 = 10%.
By considering similar rates for females, which of the following are possible values of w, x, y, z such
that Simpson’s Paradox will be observed in the above table when combining males and females
together within Class A and Class B?

(A) w = 100, x = 20, y = 80, z = 8.


(B) w = 200, x = 140, y = 100, z = 20.
(C) w = 80, x = 30, y = 200, z = 50.
(D) w = 50, x = 40, y = 200, z = 150.

25. A study was conducted to understand the relationship between a patient’s age and having cardio-
vascular disease (CVD). The information on the variables “Age” (Young/Old) and “CVD” (Has
CVD/No CVD) was collected in the table below.

Young Old Total


Has CVD 100 50 150
No CVD 100 200 300
Total 200 250 450

Furthermore, it is known that a third variable, “Smoking”, is associated with “CVD”. Using only
the information given, which of the following statements must be true? Select all that apply.

(A) Young patients are positively associated with having CVD.


(B) “Smoking” is a confounder when examining the association between “Age” and “CVD”.
(C) “CVD” is a confounder when examining the association between “Age” and “Smoking”.

26. A study was conducted to determine if treatment types (A and B) were associated with how
successful they were in curing disease X. The age of the subjects was also recorded as it is a
possible confounder. Each subject, depending on his/her age was classified as either “Old” or
“Young”. The table below shows the result of the study.
Exercise 2 67

Treatment A Treatment B
Age Number Success Number Success
Old 120 25 280 75
Young 190 45 250 65
Total 310 70 530 140

Which of the statements below is correct?

(A) Age is a confounder between treatment types and how successful they are. Simpson’s Paradox
is observed in this study.
(B) Age is a confounder between treatment types and how successful they are. Simpson’s Paradox
is not observed in this study.
(C) Age is not a confounder between treatment types and how successful they are.
(D) More information is needed before determining if age is a confounder between treatment types
and how successful they are.

27. Su is investigating the association between blood pressure and “workaholism” in a certain popula-
tion. Someone who works more than 75 hours per week is considered a workaholic.
The income level and blood pressure (high or normal) for each subject and whether or not they are
classified as “workaholic” are recorded and summarised in the table below. Here “HBP” denotes
“high blood pressure” while “NBP” denotes “normal blood pressure”.

Income Group
Low Middle High
HBP NBP HBP NBP HBP NBP
Workaholic 25 75 23 87 26 134
Non-workaholic 25 80 18 72 9 51

Which of the following statements is true?

(A) We have an instance of Simpson’s Paradox for this data set, when considering the association
between being a “workaholic” and having “high blood pressure”, first for individual income
levels (“Low”, “Middle”, “High”) and then overall.
(B) We do not have an instance of Simpson’s Paradox for this data set, when considering the
association between being a “workaholic” and having “high blood pressure”, first for individual
income levels (“Low”, “Middle”, “High”) and then overall.
(C) We are not able to determine if we have an instance of Simpson’s Paradox for this data set
(or not), when considering the association between being a “workaholic” and having “high
blood pressure”, first for individual income levels (“Low”, “Middle”, “High”) and then overall.
There is insufficient information given.

28. In NUS, the rate of coffee drinking among female students is 60% and the rate of coffee drinking
among male students is also 60%. It was found that the rate of coffee drinking among scholarship
students is 90%. Which of the following statements must be true?

(I) Coffee drinking is positively associated with scholarship students.


(II) When considering the association between coffee drinking/non-coffee drinking and scholarship
students/non-scholarship students, Simpson’s Paradox is not observed when the male and
female subgroups are combined.

(A) Only (I).


(B) Only (II).
68 Chapter 2. Categorical Data Analysis

(C) Both (I) and (II).


(D) Neither (I) nor (II).

29. In a study, there are 3 variables being recorded: Sex, Cancer Status, and Smoking Status. The rate
of cancer among females is 40%, while the rate of cancer among males is also 40%. Researchers
discovered that the rate of cancer among smokers is greater than the rate of cancer among non-
smokers. Finally, it was found that males were more likely to be smokers as compared to females.
Which of the following statements is/are true? Select all that apply.

(A) The overall rate of cancer is 50%.


(B) There is a positive association between smoking and cancer.
(C) Sex is a confounder when looking at the relationship between Smoking Status and Cancer
Status.
(D) Smoking Status is a confounder when looking at the relationship between Sex and Cancer
Status.

30. Billy collected data from his friends and found out that height is associated with self-esteem. In
particular, being tall is positively associated with having high self-esteem. Megan, one of his friends,
suggested that gender may be a confounder. Billy agreed with her and decided to investigate this.
After calculating rate(Male | Tall) and rate(Male | Short), he found that gender is associated with
height. He then shared with Megan that she was right, and that gender is indeed a confounder.
However, Megan informed Billy that he missed a step in his conclusion. Which one of the following
is a possible pair of rates Billy missed before concluding that gender is a confounder?

(A) rate(Male | High self-esteem) and rate(Female | Low self-esteem)


(B) rate(Male | High self-esteem) and rate(Female | High self-esteem)
(C) rate(High self-esteem | Male) and rate(High self-esteem | Female)
(D) rate(High self-esteem | Male) and rate(Low self-esteem | Female)
This page is blank
70 Chapter 2. Categorical Data Analysis
Chapter 3

Dealing with Numerical Data

Section 3.1 Univariate EDA

In Chapter 1, we introduced two main types of variables that we will be focussing on, namely categorical
variables and numerical variables. Categorical variables were discussed extensively in Chapter 2 and in
this chapter, we will turn our attention to numerical variables and how they can be analysed.
Consider the following table that shows a portion of a data set relating to COVID-19 cases in Singa-
pore.

An example of a numerical variable in this data set is Age. Can you identify another numerical
variable? The analysis of data, more precisely, Exploratory Data Analysis (or EDA) is a process of
summarising or understanding the data and extracting insights or main characteristics of the data. This
is a critical part of the “Analysis” step of the PPDAC problem solving cycle. In this chapter, we will
discuss how numerical variables can be summarised and understood. To begin, the focus of this section
will be on data exploration techniques for one variable, or univariate exploratory data analysis.

Example 3.1.1 In Chapter 2, the recurring data set that was used to drive the discussion on categorical
variables was the patients with kidney stones data set. In this chapter, we will be using a data set closer
to home.
72 Chapter 3. Dealing with Numerical Data

The data set (Microsoft Excel file partially shown above) that we will be looking at in this chapter
corresponds to sales of Housing Development Board (HDB) resale flats within the period of January 2017
to June 2021. The entire data set contains 99, 236 rows and 11 columns. Note that each transaction is
a row of the Excel file and each transaction contains information on variables (the columns) like month
(of sale), flat’s floor area (in square metres), resale price, etc.
The PPDAC cycle starts off with
1. Problem. So what is the problem that we are considering and attempting to answer? If you are
a potential buyer, perhaps a question that you may be interested in investigating could be

What factors may affect the pricing of resale flats sold in Singapore?

2. Plan. Here, we need to decide what are some of the variables that are relevant and possible factors
that answer the question. Suppose these variables were determined to be the 11 columns of the
data set. Some of these variables are

– “Month” - this is the month/year of the resale transaction;


– “Town” - this is the town that the resale flat belongs to;
– “Floor area sqm” - this is the floor size of the resale flat;
– “Resale price” - this is how much the flat was sold for.

3. Data. In this stage, data is collected and prepared as shown in the table above.

4. Analysis. We are now at this stage where the data is going to be analysed in attempting to answer
the Problem.

Definition 3.1.2 A distribution is an orientation of data points, broken down by their observed number
or frequency of occurrence.

Example 3.1.3 Let us look at our HDB resale flats data set. The first few rows of the data set for
transactions from January to June 2021, is reproduced in the table below.

Month Floor area sqm Age Resale price


1/1/2021 45 35 225000
1/1/2021 45 35 211000
1/1/2021 73 45 275888
1/1/2021 67 43 316800
1/1/2021 67 43 305000
1/1/2021 68 40 260000
1/1/2021 73 44 351000
1/1/2021 73 44 343000
1/1/2021 75 41 306000
Section 3.1. Univariate EDA 73

We would like to investigate the distribution of the Age1 variable. To do this, we would need to
collate the number of flats with the same ages when the resale transaction was made and put them in a
frequency table. For example the first two rows of the data indicates that the first two HDB flats in the
data set had the same age of 35 years when they were sold, while the third flat was 45 years old and so
on. Suppose the frequency table collated for the entire data set is as follows:

Age Frequency
2 9
3 8
4 583
5 1105
6 884
7 295
8 255
.. ..
. .

If we simply look at the frequency values in the table, it would be hard to observe any patterns or
gain insights into how the frequencies are distributed across the different age values. We will introduce
two different graphs to present the distribution in better way.

Example 3.1.4 (Histograms for Univariate EDA) A histogram is a graphical representation that
organises data points into ranges or bins. It is particularly useful when we have large data sets. Let
us see how the histogram will look like when we use Microsoft Excel to create one based on the “Age”
frequency from Example 3.1.3. To create a histogram, the variable values are “grouped” into equal size
intervals called bins. For our “Age” variable, we can use bins with a width of 2 years. The number of
flats in each bin are counted and tabulated.

Bins Frequency
0-2 9
2-4 591 (8 + 583)
4-6 1989 (1105 + 884)
6-8 550 (295 + 255)
8-10 336 (219 + 47)
.. ..
. .

You may notice that for the 2-4 Bin, the frequency is obtained by adding the number of flats sold
at Age 3 and Age 4 and excludes those sold at Age 2. Thus, the left-end point of the interval 2-4 is
excluded. The same is observed for the rest of the bins. The histogram created using Radiant is shown
below:

1 The data set, which can be downloaded from https://data.gov.sg/dataset/resale-flat-prices actually does not contain

the “Age” variable. The “Age” variable was created by subtracting lease commence date from the year the flat was sold.
74 Chapter 3. Dealing with Numerical Data

With the height of each bar representing the frequency for that bin range, the highest bar would
represent the most frequently occurring range of values.
From the histogram above, we see that the range 4-6 years has the highest frequency as it accounts
for 1989 out of the total 11644 transactions, or about 17% of the flats sold.

Remark 3.1.5 You may wonder how we came to the decision to have bin widths of 2 years rather
than 3 years (or any bigger number). There is no correct answer for this. Normally, we would construct
several histograms with different bin widths before deciding which one is most appropriate.
Once we have obtained and visualised the distribution of a numerical variable, we would like to
describe the overall pattern of the distribution as well as whether there are any deviations from the
overall pattern. To describe the overall pattern of the distribution, we will focus on the

1. Shape;

2. Center; and

3. Spread of the distribution.

For deviations from the overall pattern, this usually refers to identifying outliers which will be discussed
later on in this Chapter. Let us start by looking at how we can describe the shape of a distribution.

Discussion 3.1.6 (Shape - peaks and skewness). There are two important descriptors when we
discuss the shape of a distribution, namely the peaks and the skewness. Let us look at another histogram
plot obtained from the HDB resale data set. Rather than the age of the flat at the point of resale, we
consider another numerical variable of interest, which is the “Resale Price”. The following histogram
was obtained when we set a bin size of 25,000.
Section 3.1. Univariate EDA 75

There is a peak in the interval [455000, 480000]. The distribution is unimodal , which means that it
has one distinct peak. This tells us that the most frequent resale flat prices lies between $455,000 and
$480,000.
Distributions are not always unimodal. Looking at the histogram we plotted earlier for the Age of
the resale flats, we see that there is more than one distinct peak. In such a situation, we say that the
distribution is multimodal . If a distribution has exactly two distinct peaks, we say it is bimodal .

In the histogram above, we see the highest peak in the 4-6 years range and the second highest peak
occurring in the 34-36 years range. It should be noted that we say these are peaks because they occur
most frequently in their immediate neighbourhoods of age ranges.
For a unimodal distribution, we can use another descriptor to describe the shape of the distribution,
that is, whether the distribution is symmetrical or skewed .
76 Chapter 3. Dealing with Numerical Data

In a symmetrical distribution (middle picture above), the left and right halves of the distribution are
approximate mirror images of each other, with the peak in the middle.
For the picture on the left, the distribution is left skewed, with the peak shifted to the right and a
relatively long “tail” on the left.
The picture on the right shows a distribution that is right skewed. Such a distribution has the peak
shifted to the left and a relatively long “tail” on the right. Referring back to the distribution of resale
prices of HDB flats, we see that the distribution is right skewed, meaning that there are some (but few)
flats sold at very high prices. These data points gave rise to the long tail to the right of the peak.

Example 3.1.7 (Symmetrical distribution - Bell curve) One of the most well-known symmetrical
distributions is the normal distribution or what is commonly known as the bell curve. A famous example
of the normal distribution is that of the IQ scores in a population, based on the Wechsler Intelligence
scale.

From the figure, we see that the peak happens at 100, which means that the average IQ of a person in
the population is 100. We also see that about 68% of the population has IQ scores in the range between
85 and 115.

Discussion 3.1.8 (Central tendency - mean, median and mode). Besides describing the shape of
the distribution, we can also describe the characteristics of a distribution more precisely using measures
of central tendency. The three most common measures of central tendency are mean, median and mode,
which were all introduced in Chapter 1.
The three possible shapes of a distribution have different relative positions of the mean, median and
mode.

1. For a symmetrical distribution, the mean, median and mode will be very close to each other near
the peak of the distribution.

2. For a left skewed distribution, we usually (but not always) have

mean < median < mode .

To see why this is the case, notice that the small number of extremely small values which contributes
to the long tail on the left, will push down the mean/average, as compared to the median which is
less affected by these extremely small values. The mode, found at the peak of the distribution is
naturally the largest among the three measures of central tendency.
Section 3.1. Univariate EDA 77

3. For a right skewed distribution, we have the opposite of the left skewed distribution, which is

mode < median < mean .

In this case, there are a small number of extremely big values which contributes to the long tail on
the right. These big values will push up the mean/average as opposed to the median which is less
affected by these extremely large values. The mode in such a distribution would be the smallest
among the three measures of central tendency.

Example 3.1.9 Referring again to the resale prices distribution, we have seen the shape of the distri-
bution and concluded that the distribution is right skewed.
The mean, median and mode of this distribution were found to be $496,870.40, $468,000 and $420,000
respectively. This indeed agrees with

mode < median < mean .

Discussion 3.1.10 (Spread - standard deviation and range). Besides the shape and center of the
distribution, we can also describe the spread of a distribution. This refers to how the data vary around
the central tendency.

Take a look at the two distributions above, both of which have the same central tendencies. In fact,
the mean, median and mode of both distributions are 10. However, the top distribution has a relatively
lower variability compared to the distribution below. This means that the data in the top distribution
are all relatively close to the center while the data in the bottom distribution are more spread out, or
has more variability. We can also say that the data in the bottom distribution is spread across a much
wider range.
The most commonly used measure of variability is standard deviation which was introduced in Section
1.5. For the two distributions shown here, the top distribution has standard deviation 1.69 while the
bottom distribution has standard deviation 4.30.
A simpler measure of variability is the range of the distribution. This is defined to be the difference
between the largest and the smallest data points in the distribution. The range is simple to compute
but sometimes it can be misleading. For example, if we look at the range of the HDB resale prices data,
we obtain

Range = Highest resale price − Lowest resale price = $1, 250, 000 − $180, 000 = $1, 070, 000.

The range is very large and is due to the existence of a few extremely high resale prices. It is not really
the case that there is great variability in resale prices as we see that most of the resale prices are actually
much lower and the variability is not as big as the range indicates it to be.

Definition 3.1.11 An outlier is an observation that falls well above or below the overall bulk of the
data.
78 Chapter 3. Dealing with Numerical Data

Consider the data set with 11 data points shown above. We can consider 75 and 85 as outliers since
they are way larger than the rest of the data points. At this point, we use our judgement to identify
values that appear to be exceptions to the general trend in the data. Later on, we will be introducing a
more precise method (boxplot) to identify outliers.
Identifying outliers can be useful when we wish to identify any strong skewness in a distribution.
Sometimes the outliers are caused by erroneous data collection or data entry but this may not always be
the case. It is also possible that outliers are legitimate data points that provide us interesting insights into
the behaviour of the data. A general rule when we investigate a data set is that outliers should not be
removed unnecessarily as they do tell us something about the behaviour of the variable and prompt us
to investigate further why such extreme values can happen.

Example 3.1.12 Consider the data set below:

4, 4, 5, 5, 5, 5, 6, 6, 6, 7, 7, 300.

It is not difficult to be convinced that 300 is an outlier in the data set. The table below shows the three
different central tendencies as well as the standard deviation for the entire set and also when the outlier
is removed from the data set.
Mean Median Mode Standard deviation
Without removing 300 30 5.5 5 85.03
With 300 removed 5.45 5 5 1.04

We see that between the three central tendencies, the mean seems to be the most affected by the
removal of the outlier, while both the median and the mode either remained the same or only changed
slightly. Without removing the outlier, the mean is pulled away in the direction of the skew (in this
example, the distribution is skewed to the right). In such cases, mean may no longer be a good measure
of the central tendency of the distribution. We call the median and the mode robust statistics.
In addition, the standard deviation also increases greatly from 1.04 to 85.03 because of the outlier.
This is expected because the standard deviation measures the spread of the data points and with the
outlier being far away from the other data points, the variability of the distribution is understandably
high.
As mentioned above, we need to treat outliers with care. If they have minimal effect on the conclusions
and if we cannot figure out why they are there, such outliers may possibly be removed. However, if they
substantially affect the results, then we should not drop them without justification.

Example 3.1.13 Suppose we are interested to find out if there are significant differences in the dis-
tribution of HDB resale prices for different time periods. For example, would the distributions differ
significantly if we compare the period July to December 2020 with January to June 2021? The two
distributions are shown below.
Section 3.1. Univariate EDA 79

The distribution on the left corresponds to the period of resale from July to December 2020. The
distribution for January to June 2021 is shown on the right. We observe that both distributions have
a similar shape which is right skewed with a single peak. Taking it one step further, we compare the
central tendencies and variabilities of the data points in both periods. The values in the table can be
computed using the Microsoft Excel Data Analysis Toolpak.

Mean Median Mode Range Standard deviation


July to December 2020 $462,827 $435,000 $400,000 $1,098,000 $155,955
January to June 2021 $496,870 $468,000 $420,000 $1,070,000 $162,107

Observe that all measures of mean, median and mode are higher in the time period January to June
2021 compared to those in the time period July to December 2020. The range of the resale prices is lower
in January to June 2021 while the standard deviation is actually higher. In conclusion, we can say that
resale prices in January to June 2021 are higher, but more spread out (in terms of standard deviation)
compared to the resale prices in July to December 2020.

Example 3.1.14 In Example 3.1.4, we described the setting of bin widths when creating a histogram.
Deciding the bin width to use can have a big impact on how the histogram looks like and thus affect our
observation and conclusion on the shape of the distribution.

The three histograms above are constructed using the same data set of 233 students’ final exam scores
with the only difference being the bin width settings. The histogram on the left has a bin width of 20,
while the one in the middle has bin width of 10. The last histogram has bin width set at 5. What
conclusions can be made on the distribution based on these histograms?
Based on the first histogram, we may make the conclusion that most students score between 60 to 80
marks, and the distribution is rather symmetric. However, with a slightly smaller bin width, the second
histogram reveals that most students actually scored between 70 to 80 marks. This does not contradict
the observation made earlier based on the first histogram but because of the smaller bin width, we are
able to narrow the range of marks that are scored by most students. With an even smaller bin width, the
third histogram suggests that most students scored between 65 and 75 marks. How do you rationalise
this conclusion with the one from the second histogram?
In general, we should bear in mind the following when determining bin widths for histograms.

1. Avoid histograms with bin widths that are too large. This will result in only a few bins and
information in the data will be lost when data points are grouped together into a small number of
groups/bins.

2. Avoid histograms with bin widths that are too small. If we do this, there may be bins that have
very few data points (or none) that does not give us a sense of the distribution.

3. Our initial choice of bin width may not be the most appropriate. Different histograms with various
bin widths should be created before deciding which one is the most useful and informative.

Remark 3.1.15 We should not confuse histograms with bar graphs introduced in Chapter 2. A his-
togram shows the distribution of a numerical variable across a number line. So one of the axes (usually
the horizontal) will display the range of values taken on by the numerical variable. On the other hand,
the horizontal axis of a bar graph will show the different categories of a categorical variable.
In addition, the ordering of the bars in a histogram cannot be changed, as it progresses through the
range of values, usually in an ordered manner, taken on by the numerical variable. On the other hand,
80 Chapter 3. Dealing with Numerical Data

the ordering of the bars in a bar graph can be switched around with little consequence. There are also
usually no gaps between the bars in a histogram.

Discussion 3.1.16 (Boxplots for Univariate EDA) Besides a histogram, another way to visualise
the distribution of a numerical variable is to use a boxplot. To construct a boxplot, we will use the
five-number summary, consisting of

1. Minimum;

2. Quartile 1 (Q1 );

3. Median (Q2 );

4. Quartile 3 (Q3 );

5. Maximum.

The median and quartiles have already been introduced in Definition 1.6.1 and Definition 1.6.5. Fur-
thermore, we have also introduced the Interquartile range

IQR = Q3 − Q1 .

While the median can be viewed as the center of a data set, the IQR is a way to quantify the spread of a
data set. We have defined an outlier in Definition 3.1.11 but did not provide an explicit way to classify
a data point as an outlier. For our purpose we will adopt the following consideration to classify a data
point as an outlier.

A data point is considered an outlier if it satisfies


one of the following conditions:

ˆ The value of the data point is greater than Q3 + 1.5 × IQR;

ˆ The value of the data point is less than Q1 − 1.5 × IQR.

To construct a boxplot, we do the following:

1. Draw a box from Q1 to Q3 .

2. Draw a vertical line in the box where the median (Q2 ) is located.

3. Identify all the outliers by using the consideration above.

4. Extend a line from Q1 to the smallest value that is not an outlier and another line from Q3 to the
largest value that is not an outlier. These lines are called whiskers.

5. Mark each of the outliers with dots or asterisks.

Example 3.1.17 Consider the following data set, with the data points already sorted in increasing
order.
18, 44, 47, 55, 61, 62, 78, 79, 83, 145.
There are 10 data points. The median (Q2 ) is the average of the fifth and sixth data points, so
1
Q2 = (61 + 62) = 61.5.
2
The first quartile is the median of the first five data points: 18, 44, 47, 55, 61, so Q1 = 47. The third
quartile is the median of the last five data points: 62, 78, 79, 83, 145, so Q3 = 79. Following Remark
1.6.9, it should be pointed out that you may encounter slightly different ways of finding quartiles for a
data set in other texts. For this course, we will adopt what is presented here.
Section 3.1. Univariate EDA 81

The Interquartile Range is


IQR = Q3 − Q1 = 79 − 47 = 32.
To determine if we have outliers, note that 1.5 × IQR = 48. Since there are no data points smaller than
Q1 − 48, there are no small-valued outliers. On the other end, since 145 > Q3 + 48 = 79 + 48 = 127, we
see that 145 is the only big-valued outlier.
The boxplot constructed is shown below.

Example 3.1.18 Let us return to the HDB resale flats data set. The boxplot below is based on the
resale prices of flats sold in January to June 2021.

The boxplot confirms our earlier conclusion that there are outliers that correspond to very high resale
prices. Note that the cross in the box, just above the median line represents the mean resale price. Recall
that we have discussed the shape, center and spread of the distribution using a histogram. What can we
say based on the boxplot?

1. (Shape) From the boxplot, we see that the variability in the upper half of the data, given by (Max
− Median) is significantly larger than the variability in the lower half of the data which is equal
to (Median − Min). This confirms our earlier observation that the distribution is skewed to the
right and there is a relatively long tail to the upper end of the distribution due to the existence of
outliers.

2. (Center) The center, described by the median is easily observed in the boxplot, unlike in a
histogram. We can also compare the relative positions of the median and the mean from the
boxplot.

3. (Spread) The IQR of 204, 000 gives us an idea of the spread for the middle 50% of the data set.
On its own it may not be immediately informative but this would be a meaningful measure to
compare across different distributions (see next example).

Example 3.1.19 The three boxplots below show the distributions of resale flat prices in three different
time periods, namely January to June 2020 (call this period P1), July to December 2020 (call this period
P2) and January to June 2021 (call this period P3). What can we say about the three distributions after
comparing the three boxplots?
82 Chapter 3. Dealing with Numerical Data

1. All three distributions are right skewed as the upper halves of the data have greater variability than
the lower halves, due to (large-valued) outliers. However, upon a closer look, it is also apparent
that the upper half variability in period P1 is greater than the upper half variability in P2 which
in turn is greater than the upper half variability in P3.
2. The middle 50% (that is, the IQR) box of resale prices is lowest in P1, followed by P2 and then
P3. Hence, the overall resale prices have increased over time. The spread (given by the height of
the boxes) appears to be similar between P1 and P2 while slightly higher in P3.
3. There appears to be more outliers in P1 and P2 compared to P3.
To conclude this section, we summarise the comparison between using histograms and boxplots to
represent a distribution.
1. A histogram typically gives a better sense of the shape of the distribution of a variable, compared to
a boxplot. When there are great differences among the frequencies of the data points, a histogram
will be able to illustrate this difference better than a boxplot.
2. If we wish to compare the distributions of different data sets, putting the different boxplots side
by side is more illustrative than using histograms.
3. To identify and indicate outliers, boxplots do a better job than histograms.
4. The number of data points we have in a data set is better shown in a histogram than in a boxplot.
In fact, two distributions with very different number of data points can have almost identical
boxplots. On the other hand, this difference is apparent by comparing the histograms.
The bottom line is that different graphics and summary statistics have their advantages and disad-
vantages and they are often used together to complement each other.

Section 3.2 Bivariate EDA


In this section, we will focus on how we can investigate a relationship between two variables in a popu-
lation.

Discussion 3.2.1 We start off with a relationship between two variables that is deterministic. This
means that the value of one variable can be determined exactly if we know the value of the other variable.
Perhaps the most common type of deterministic relationship is the one that involves the conversion of
units of measurement from one metric to another. For example:
Section 3.2. Bivariate EDA 83

1. The relationship between Fahrenheit (F ) and Degree Celsius (C) in the measurement of tempera-
ture. We know that F and C are related by

5
C = (F − 32) × .
9

This is a deterministic relationship between F and C. For example, if the temperature in the oven
now is 450 degrees Fahrenheit (so F = 450), then the temperature in the oven now, measured in
Degree Celsius is
5
C = (450 − 32) × = 232.22.
9

2. Meters (M ) and Feet (F ) are both measurements of length (or height) and they are related (ap-
proximately) by
F = 3.2808 × M.

So, if Johnny’s height is 5.9 Feet (so F = 5.9), then his height in meters will be

F 5.9
M= = ≈ 1.8 meters.
3.2808 3.2808

Discussion 3.2.2 The main focus of this section is on a relationship between two variables that is not
deterministic in nature. We say such a relationship is statistical or non-deterministic. Recall that in a
deterministic relationship, given the value of one variable, we can find a unique value of another variable.
However, this is not possible for a statistical relationship, where given the value of one variable, we can
describe the average value of the other variable. Such relationships between variables, called associations
occur quite often in our daily life.

Example 3.2.3 In a Medical News Today article2 published in November 2020, it was reported that in
a study involving more than 150, 000 participants, a clear link was observed between low physical fitness
and the risk of experiencing symptoms of depression, anxiety, or both.

This association between physical fitness and mental health may not be surprising but we wonder if
it could be due to other factors, like a confounder. More interestingly, does having better fitness make
a person mentally healthier or having better mental health make a person exercise more resulting in
better physical fitness? We will not only measure the association (if one exists) between variables but
also attempt to interpret any observed associations.
Bivariate data is data involving two variables. For example, in the HDB resale flat data set, we can
study the two variables Age and Resale Price .

2 https://www.medicalnewstoday.com/articles/large-study-finds-clear-association-between-fitness-and-mental-health
84 Chapter 3. Dealing with Numerical Data

Month Floor area sqm Age Resale price


1/1/2021 45 35 225000
1/1/2021 45 35 211000
1/1/2021 73 45 275888
1/1/2021 67 43 316800
1/1/2021 67 43 305000
1/1/2021 68 40 260000
1/1/2021 73 44 351000
1/1/2021 73 44 343000
1/1/2021 75 41 306000

In Section 3.1, we saw two ways to display univariate data, using either a histogram or a boxplot. For
bivariate data, it is clear that using a table like the one above is not really useful if we wish to investigate
if the two variables are associated. Instead, we will use a scatter plot to give us an idea of the pattern
formed by the data between the two variables in question. After looking at the scatter plot, we use a
quantitative measure called the correlation coefficient to quantify the level of linear association (if any)
between the two variables. Finally, we will attempt to fit a line or a curve through the points in the
data set which will enable us to make predictions on the values of the variables. This process is known
as regression analysis. For now, we will focus on scatter plots and defer the discussion on correlation
coefficients and regression analysis to the next few sections.

Example 3.2.4 Returning to our HDB resale flats prices data set, we will focus on the bivariate data
with the variables Age and Resale price. Suppose we wish to know if the age of the flat affects the resale
price, with the ultimate intention to make a prediction, based on the past resale prices, of how much
a 38 year old resale flat is likely going to cost. In this case, we can treat age as the independent (or
explanatory) variable and resale price as the dependent (or response) variable.

Our scatter plot shown above has the age (independent) variable on the x-axis and the resale price
(dependent) variable on the y-axis. Each resale transaction would be represented by an ordered pair

(x, y)

where x is the age of the resale flat and y is the resale price of that flat. For example, the ordered pair
(35, 225000) corresponds to the first resale flat listed in the table above. With a point plotted for each
ordered pair, since there are 11, 644 resale transactions in the data set, there will be 11, 644 points on the
scatter plot. Observe that in the scatter plot, each value of x (age of flat) corresponds to many different
values of y (the resale price). This is to be expected because there are many different transactions
involving flats of the same age and all these transactions are made at different resale prices.
How do we describe the relationship between two numerical variables using a scatter plot?
Section 3.2. Bivariate EDA 85

Univariate data Bivariate data


Overall pattern Deviation Overall pattern Deviation
from the pattern from the pattern
1) Shape Outliers 1) Direction Outliers
2) Center 2) Form
3) Spread 3) Strength

We have seen that for univariate data, we discussed the shape (symmetrical or skewed), center (me-
dian, mean and mode) and spread (interquartile range, standard deviation and range) of the distribution.
For bivariate data, we will use descriptors like the direction, form and strength to describe the relationship
between the two variables. For both univariate and bivariate data, data points that deviate significantly
from the pattern of the main bulk of data points are called outliers.

Definition 3.2.5 The direction of the relationship can be either positive, negative or neither. We say
that there is a positive relationship between two variables when an increase in one of the variables is
associated with an increase in the other variable.
On the other hand, a negative relationship between two variables means that an increase in one
variable is associated with a decrease in the other.
Not all relationships can be classified as either positive or negative and there are those that do not
behave in one way or the other.

The form of the relationship describes the general shape of the scatter plot. In general, we can classify
the form of the relationship as either linear or non-linear. The form of the relationship is linear when
the data points appear to scatter about a straight line. Later in the chapter, we will use a mathematical
equation to describe the straight line when the form of the relationship between two variables is linear.
When the data points appear to scatter about a smooth curve, we say that the form of the relationship
is non-linear. It is beyond the scope of this course to summarise curve patterns in the data but it is
useful to note that quadratic and exponential equations are examples of non-linear forms of relationship.
86 Chapter 3. Dealing with Numerical Data

The two scatter plots on the left shows a linear form of the relationship between the two variables
while the two scatter plots on the right shows non-linear forms.
The strength of the relationship indicates how closely the data follow the form of the relationship.

Both scatter plots above suggests that there is positive, linear relationship between the two vari-
ables. However, the scatter plot on the left shows the data points lying very close to the straight line.
This indicates that the strength of the relationship is strong. The scatter plot on the right shows the
data points scattered loosely around the straight line and thus the strength of the relationship is weaker
than that in the scatter plot on the left.

Example 3.2.6 Let us look at the scatter plot from the HDB resale flats data again. The scatter plot
below is similar to the one from Example 3.2.4 except for an additional trendline drawn in black.

The trendline suggests that as the age of the HDB flat increases, the resale price decreases linearly
on average, in the period of January to June 2021. Is this relationship strong or weak? In fact, one can
argue that without the trendline, one may not even observe that there is a linear relationship between
age and resale price.
At this point, we cannot really tell if there is indeed a linear relationship and if there is, whether the
relationship is strong or weak. Nevertheless, in the next section, we will discuss a more precise measure
of the strength of a relationship.
As mentioned earlier, outliers are data points that deviate significantly from the pattern of the
relationship. Consider the scatter plot shown below that plots the resale price against the floor area of
the HDB resale flats. Do you observe any outliers?
Section 3.3. Correlation coefficient 87

Recall that for univariate data, using a boxplot, we can determine if a data point is an outlier by
checking if its value is greater than Q3 + 1.5 × IQR or smaller than Q1 − 1.5 × IQR. What about for
bivariate data? We will discuss more about outliers in the next section.

Section 3.3 Correlation coefficient

In the previous section, using the HDB resale flats data set, we have observed that a flat’s resale price
is associated with the age of the flat. From the scatter plot, we concluded that the relationship between
the age of the flat and the resale price of the flat was negative. This means that flats whose ages were
higher tended to have a lower resale price. This is not surprising. However, can we say anything about
whether this relationship is strong or weak? If possible, can we measure the strength of this relationship
using a number?
More generally, given two numerical variables, is it possible for us to measure the relationship between
the two variables quantitatively?

Definition 3.3.1 The correlation coefficient between two numerical variables is a measure of the linear
association between them. The correlation coefficient, denoted by r, always ranges between −1 and 1. We
can use this number to summarise the direction and strength of linear association between two variables.
The sign of r tells us about the direction of the linear association. If r > 0, then the association
is positive, which means that when one of the variables increase, the other variable will tend to increase
as well. On the other hand, if r < 0, then the association is negative, which means that when one of
the variables increase, the other variable will tend to decrease. In the event that r = 1 (resp. r = −1),
we say that there is perfect positive association (resp. negative association). When r = 0, we say there
is no linear association. Thus, while the sign of r tells us the direction of the linear association, the
magnitude of r (that is, how close r is to 1 or −1) will tell us the strength of the linear association
between two numerical variables.

Example 3.3.2 The two scatter plots below are examples of positive linear association between two
variables.
88 Chapter 3. Dealing with Numerical Data

The plot on the left plots the price index of HDB flats against the price index of condominiums. We
observe that there is positive linear association between the two indices, which means that as the price
of HDB flats increase, it is likely that the price of condominiums would increase as well. The value of r
in this case is 0.95 which indicates that the association is strong.
The plot on the right shows the midterm mark of students against the final mark. Again, we observe
that there is positive linear association between the two marks and in this case, r was found to be 0.75.
The next two scatter plots are examples of negative linear association between two variables.

The plot on the left shows the price of oil against the price of gold. In this case, we observe that the
trend is that when the price of gold increases, the price of oil tends to decrease. The value of r was found
to be −0.67 and this indicates that there is negative linear association between gold and oil prices.
The plot on the right shows the amount of financial aid received by students against the students’
family income. It is not surprising to find that as the family income increases, the amount of financial
aid received by students would tend to decrease. The value of r in this case is −0.49 and there is negative
linear association between the two variables.

The two scatter plots above are examples where r = 0. This means that there is no linear association
between the two variables. However, note that while r = 0 for the second plot, we can see that the data
points fit very well onto a curve and there is a clear non-linear relationship between X and Y . More
generally, no linear association between variables does not necessarily mean no association between
variables.
Section 3.3. Correlation coefficient 89

The two plots above show situations where there is perfect (positive or negative) linear correlation
between the two variables. In such cases, all the data points are connected by (and thus lie on) a straight
line. There is however, one exception, which is when the straight line joining all the data points is
actually a straight horizontal (or vertical) line. In such instances, the value of r is 0 and there is no
association between the two variables. This is because when the data points are connected by a vertical
or horizontal line, a change of value in one of the variables does not relate to a change in the other
variable.
When describing the strength of a linear relationship, we usually follow the rule of thumb as given
in the diagram below.

When the magnitude of r is between 0.7 and 1, we say that the two variables have a strong linear
association. If the magnitude is between 0.3 and 0.7, the two variables have a moderate linear association.
If the magnitude is between 0 and 0.3, the two variables have a weak linear association. Do note that
other sources may differentiate strong/moderate/weak linear associations at other “cut-off” points that
are different from 0.3 and 0.7.
In general, as the value of r becomes closer to 1 or −1, the data points will increasingly fall more
closely to a straight line. Scatter plots where the data points are loosely dispersed typically mean that
correlation is weak (or non-existent). We will now discuss how to compute the value of r numerically.

Example 3.3.3 We will go through the steps required to compute the correlation coefficient using an
example. Consider the following table that shows a total of 10 data points of bivariate data (x, y):

x 9 4 5 10 6 3 7 2 8 1
y 41 17 28 50 39 26 30 6 4 10

1. First compute the mean and standard deviation of x and y. (Refer to Definition 1.4.1 and Definition
1.5.1 if you have forgotten how these are computed.) For this data set, we find the mean and
standard deviation of x to be 5.5 and 3.03 respectively while the mean and standard deviation of
y are 25.1 and 15.65 respectively.

2. Convert each value of x and y into standard units. To convert x (resp. y) into its standard unit,
we compute  
x−x y−y
resp. ,
sx sy
where sx and sy are the standard deviations of x and y respectively. The table below shows the
values of x and y after they have been converted to standard units.
90 Chapter 3. Dealing with Numerical Data

x 1.16 −0.50 −0.17 1.49 0.17 −0.83 0.50 −1.16 0.83 −1.49
y 1.02 −0.52 0.19 1.59 0.89 0.06 0.31 −1.22 −1.35 −0.96

3. Compute the product xy in their standard units for each data point. The table below has an
additional row for the value xy for each data point.

x 1.16 −0.50 −0.17 1.49 0.17 −0.83 0.50 −1.16 0.83 −1.49
y 1.02 −0.52 0.19 1.59 0.89 0.06 0.31 −1.22 −1.35 −0.96
xy 1.17 0.26 −0.03 2.36 0.15 −0.05 0.15 1.41 −1.11 1.43

4. Sum the products xy obtained in the previous step over all the data points and then divide the
sum by n − 1, where n is the number of data points. The result is the correlation coefficient r. For
the data set above,
1
r= (1.17 + 0.26 − 0.03 + 2.36 + 0.15 − 0.05 + 0.15 + 1.41 − 1.11 + 1.43) = 0.64.
9

Remark 3.3.4 For the purpose of this module, you will not be required to compute r manually, instead
you should be familiar with the method of how r is computed and thereby develop some basic intuition
on the properties of r.

Example 3.3.5 Let us revisit Example 3.2.6, where the scatter plot of HDB resale flat prices against
the ages of the flat shown below does indeed suggest that these two variables are negatively associated.

Indeed, upon computing the correlation coefficient between these two variables, we find that r =
−0.356, confirming that there is moderate negative linear association between the age and resale price
of HDB flats from the period January to June 2021.
We will now present three properties of correlation coefficients.

1. From the “Age” vs. “Price” of HDB resale flats example, we saw that r = −0.36 when we consider
the scatter plot with Age as the x-axis and Resale price as the y-axis. What would happen to r if
we had done the plot with Resale price as the x-axis and Age as the y-axis? In other words, what
happens to r when we interchange the x and y variables? If we revisit the process that describes
how r is computed from a bivariate data set, you would realise that regardless of which variable is
x (or y), the computation of r would not be affected in any way.
Section 3.3. Correlation coefficient 91

The correlation coefficient r is not affected by interchanging the x and y vari-


ables.

2. What would happen to the value of r if we add a constant to all the values of a variable? For
example, suppose it was discovered that there was an error in the recording of all the resale prices
of HDB flats and that the actual resale prices were all $1000 higher than what was given in the
data set. To correct this error, we would have to add $1000 to all the resale prices in the data set.
It turns out that such a change does not affect the value of r.

The correlation coefficient r is not affected by adding a number to all values of


a variable.

While this may not be immediately obvious, you are encouraged to verify this result by using the
data set in Example 3.3.3 and adding some number to all the values of x (or y).

3. Instead of adding the same number to all the values of a variable, what would happen to the value
of r if we multiply a positive number to all the values of a variable instead? For example, if the
resale prices were converted to US dollars instead? This means that we have to multiply a factor
of 0.73 (assuming an exchange rate of 1 Singapore dollar is to 0.73 US dollars) to all the resale
prices in the data set. It turns out that such a change again does not affect the value of r.

The correlation coefficient r is not affected by multiplying a positive number


to all values of a variable.

You are again encouraged to verify this result by adjusting the data set in Example 3.3.3 and
recalculating the correlation coefficient.

While the correlation coefficient between two numerical variables is insightful, there are certain limi-
tations.

Discussion 3.3.6

1. Association is not causation. To confuse association with causation is a common mistake that
is made by many. Very often when there is a strong association between two variables, with a
correlation coefficient of r that is close to 1 or −1, it is mistakenly concluded that any change in
the explanatory variable, say x, will cause the response variable y to change. This is incorrect as
what we can conclude is only a statistical relationship between x and y and not a causal relationship.
92 Chapter 3. Dealing with Numerical Data

Consider the example above of a scatter plot that came from a data set containing information
on the percentage of people that earned a Bachelor’s Degree in 2017 across 3142 counties in the
United States, as well as the per capita income of these counties in 2017.3 Each data point in the
scatter plot represents a county. The x-axis is the per capita income in the past 12 months while
the y-axis is the percentage of the population in the county that earned a Bachelor’s Degree in
2017. The correlation coefficient for the two variables is 0.79, which indicates that there is strong
and positive association between the two variables.
It would be tempting to conclude that the higher the per capita income of a county, the higher
the percentage of the county’s population would have earned a Bachelor’s Degree. This is not
necessarily true. The data here merely suggests association of the two variables and does not
establish any causal relationship.
2. r does not tell us anything about non-linear association. The correlation coefficient r,
as defined and described in this section, measures the degree of linear association between two
numerical variables. Whatever the computed value of r is, it does not give any indication of
whether the two variables could be associated in a non-linear way.

The correlation coefficients for the three scatter plots above are small but yet there is actually a
strong relationship between the variables. The value of r is small because the relationship between
the variables is not a linear one. It is always a good practice to look at a scatter plot of the data
set and not just deduce any relationship between the variables from the computed value of r.
3 Data set can be downloaded from www.openintro.org/data/?data=county complete.
Section 3.3. Correlation coefficient 93

3. Outliers can affect the correlation coefficient significantly. Outliers are observations that
lie far away from the overall bulk of the data. How do outliers affect the value of the correlation
coefficient? The removal of outliers from a data set can have different effects on the correlation
coefficient, depending on how the outlier is positioned in relation to the rest of the data points.

Consider the scatter plot on the left, where the outlier is circled, the correlation coefficient is 0.22
based on the data set that includes the outlier. However, when we remove the outlier, we see that
there is a strong positive linear association between the remaining data points. Thus, in this case,
the presence of the outlier decreases the strength of the correlation, compared to when the outlier
is removed.
Consider the scatter plot on the right where again the outlier is circled. In this case, the correlation
coefficient is −0.75 based on the data set that includes the outlier. When the outlier is removed,
the remaining data points give a correlation coefficient of 0.01. Thus, in this case, the presence
of the outlier actually increases the strength of the correlation, compared to when the outlier is
removed.

Example 3.3.7 For the HDB data set that we introduced earlier, the scatter plot below shows the
relationship between the resale price and the floor area of the flat. There are three outliers (circled) and
these are resale flats whose floor areas are larger than 200 square meters.

Using a statistical software, it was found that the correlation coefficient was 0.626 before the out-
liers were removed. After the outliers are removed, the correlation coefficient becomes 0.625, which is
practically the same as before.

Definition 3.3.8 So far, we have discussed correlation in the setting where individual data points are
considered. For example, the collection of data points could represent individuals from a population.
However, we can also examine the data at an aggregated level by grouping these individuals based
on factors like ethnic group or education level. An ecological correlation is computed based on the
94 Chapter 3. Dealing with Numerical Data

aggregates rather than on the individuals. Thus, ecological correlation represents relationships observed
at the aggregate level, considering the characteristics of groups rather than individuals.

Example 3.3.9 Consider the scatter plot below for a data set consisting of individuals belonging to
three distinct groups. The three groups are represented by the symbols circle, cross and plus.

The correlation coefficient computed at the individual level is r = 0.85, indicating that there is a
moderate and positive linear association between the variables X and Y . Suppose we compute the group
averages (for X and Y ) for the three subgroups and obtain the three red dots as shown in the figure.
These three red dots, or aggregate points, align rather closely along a straight line. In fact, if we compute
the correlation coefficent based on these three agrregate points, the correlation coefficent would be 0.98.
Consequently, this example illustrates that the ecological correlation derived from group averages
suggests a more pronounced (since 0.98 is closer to 1 than 0.85) positive linear association compared to
correlation calculated at the individual level.
This phenomenon does not happen all the time. In general, when the association for both individuals
and aggregates are in the same direction, the ecological correlation based on aggregates will typically
overstate the strength of the association in individuals. Without getting into details, the intuitive
explanation for this is because the variability among individuals will not be as significant when correlation
is computed based on group aggregates.

Definition 3.3.10 The previous example reminds us that correlation at the individual level and at
aggregate level may tell us a different story about our data set. We need to be careful not to make
any wrong deductions. Consider the scatter plot below that represents the relationship between two
variables.
Section 3.4. Linear regression 95

There are clearly four distinct subgroups of individuals (grouped by the four ovals). If we consider the
subgroup averages, represented by the four red dots in the diagram, the correlation between these four
subgroup averages suggests that there is a positive linear association, as indicated by the blue regression
line. Can we now conclude that at the individual level, there is also a positive linear association between
the two variables?
This is not the case. If we look at the individual level within each subgroup, we notice a weak, but
nevertheless negative linear association between the two variables. Thus, we would have been wrong if
we drew conclusion about correlation at the individual level based on what we observe at the aggregate
level. If we do so, we would have committed what is known as ecological fallacy.
The moral of the story is that we should not assume that correlations based on aggregates will hold
true for individuals. Ecological correlation and correlation based on individuals are not the same and
should not be confused.

Definition 3.3.11 Consider another scatter plot below, again representing the relationship between
two variables X and Y .

Again, there are clearly three distinct subgroups of individuals in the data set and within each
subgroup, we observe a strong positive linear association between the two variables. Can we now conclude
that at the aggregate level, there is also a positive linear association between the two variables?
The three subgroup averages, represented by the three red dots are shown. It turns out that there is
actually no clear correlation between the variables at the aggregate level. Based on the correlation we
observed at the individual level, if we had mistakenly concluded that the same correlation would exist
at the aggregate level, we would have committed what is known as atomistic fallacy.
To differentiate the two types of fallacies described above can be confusing initially. The following
table summarises them.

Fallacy Using To conclude


Ecological Ecological correlation (aggregate level) Individual level correlation
Atomistic Individual level correlation Ecological correlation (aggregate level)
96 Chapter 3. Dealing with Numerical Data

Section 3.4 Linear regression

Now that we have seen that the age of a HDB resale flat is negatively associated with the resale price,
it is reasonable to wonder if we can make some predictions on the resale price of a flat given the age of
the flat. For example for a flat that is 40 years old, what is our guess for its resale price?

Definition 3.4.1 If we believe that two variables X and Y are linearly associated, we may model the
relationship between the two variables by fitting a straight line to the observed data. This approach is
known as linear regression. Recall that the equation of a straight line is given by

Y = mX + b,

where b is the y-intercept and m is the slope or gradient of the line. The y-intercept is the value of
Y when the value of X is 0. The slope of the line is the amount of change in Y when the value of X
increases by 1.

In the figure above, the straight line in red is the regression line that is fitted to the observed data,
represented by the blue dots. Consider the i-th observation (Xi , Yi ). The “?” in the figure represents the
residual of the i-th observation, which is the observed value of Y for Xi (that is, Yi ) minus the predicted
value of Y for Xi (predicted by the straight line). This residual, denoted by ei , is sometimes also called
the error of the i-th observation as it measures how far the predicted value is from the observed value.

Example 3.4.2 Let us return to the question we posed at the beginning of this section. What is our
prediction for the resale price of a HDB flat that is 40 years old?
Section 3.4. Linear regression 97

With X representing the age of the resale flat and Y being the resale price, the regression line obtained
from the data set is
Y = −4007X + 591857.

This means that when X = 40, (age of resale flat is 40),

Y = −4007 × 40 + 591857 = 431577.

So the predicted resale price of a 40 year old flat is $431,577. It is important to note that we are not
concluding that

A 40 year old resale flat will be sold at $431,577.

But instead our linear regression model predicts that

The average resale price of 40 year old HDB flats is $431,577.

Furthermore, as the correlation between resale flat price and age of the flat is weak, the prediction
obtained from the linear regression above may not be as accurate compared to the scenario where the
correlation is stronger.
Now that we have seen how a regression line can be used, the question is how do we obtain such a
line given bivariate data? What method and principle is used to determine the regression line? Among
the many different straight lines that we can use to fit the data points, which one is the “best”?

Discussion 3.4.3 There are several ways to assess which straight line fits the observed data better.
One of the most common way is the method of least squares. For this module, we will not go into the
technicalities of this method but instead we will briefly describe the idea behind this method.
Recall that when we fit a straight line through a set of observed data points (xi , yi ), the difference
between the observed value yi and the predicted outcome, predicted by the straight line, is known as
the residual of the i-th observation. This residual, denoted by ei is also known as the error of the i-th
observation that measures how far is the observed from the predicted.
98 Chapter 3. Dealing with Numerical Data

In the plot above, we see that each data point gives rise to an error term and it is reasonable to say
that a line of good fit is one that keeps the error terms (considered over all data points) small. However,
instead of looking at the overall error by summing up

e1 + e2 + · · · + en ,

where n is the total number of data points, the method of least squares seek to find a straight line that
minimises the overall sum of squares of errors,

e21 + e22 + · · · + e2n .

You may wonder why minimising e21 + e22 + · · · e2n is more appropriate than minimising e1 + e2 + · · · + en .
We will leave you to ponder about this question before having a discussion with your friends or instructor.

Remark 3.4.4

1. It is useful to note that the least squares regression line obtained from a set of observed data points
(xi , yi ) will always pass through that point of averages for that data set, that is, (x, y). This fact
can be established mathematically, but is beyond the scope of this course.

2. It is important to note that while we have obtained the least squares regression line that allows us
to predict the average resale price for a given age of the resale flat, the same regression line cannot
be used to predict the average age of resale flats for a given resale price. The reason is essentially
because of the way the regression line was obtained.
In obtaining the regression line with the independent variable (x) as age and the dependent variable
(y) as the resale price, the line was fitted to minimise the square of error terms between the observed
and predicted resale prices.
If the intention was to use a given resale price to predict the average age of the resale flats, then
we would be looking at another regression line that minimises the square of error terms between
the observed and predicted ages of resale flats.
The two regression lines are different and thus not interchangeable.
Section 3.4. Linear regression 99

3. The correlation coefficient r between the variables X and Y is closely related to the regression line

Y = mX + b

obtained using the method of least squares. More precisely, we have


sY
m= r,
sX
where sX (resp. sY ) is the standard deviation of X (resp. Y ). With this relationship, we see that
if the correlation coefficient r is positive, then the gradient of the regression line is also positive.
Similarly, if the correlation coefficient is negative, then the gradient of the regression line will also
be negative. However, it is important to remember that the correlation coefficient is not necessarily
equal to the gradient of the regression line.

4. Another important point to note about the linear regression line obtained using a data set is with
regards to the range of the independent variable in the data set.

Recall that we have obtained the linear regression line for the purpose of predicting the average
resale price based on the age of the resale flat. From the data set, the value of the independent
variable (in this case, this is the age of the resale flat) ranges from 2 to 54 years. Thus the prediction
that can be arrived at using the regression line is only applicable for HDB flats whose age is between
2 and 54 years old. Outside this range, we should not use the regression line to make our prediction
as the best fit regression line may change outside this range. For example, we should not use the
regression line to predict the average resale price of flats that are 60 years old as our data set does
not contain any information on resale flats that are more than 54 years old.

Discussion 3.4.5 To conclude this section and also the chapter, we will describe a method to study
the relationship between two variables if the relationship is not linear. The following table shows part of
a data set that provides the total number of confirmed COVID-19 cases in South Africa since 5 March
2020.4 .
4 Data set can be downloaded from www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset.
100 Chapter 3. Dealing with Numerical Data

t Total confirmed cases


76 17200
77 18003
78 19137
79 20125
.. ..
. .
95 48285
96 50879
.. ..
. .

In this data set, t is the variable representing the number of days since 5 March 2020.
It can be computed using Microsoft Excel or other statistical software that the correlation coefficient
between the total number of confirmed cases and t is 0.812, which indicates that there is a strong
positive linear association between the two variables. Is this indeed the case? Perhaps we may make
such a conclusion but as stated earlier, correlation coefficient alone does not give the entire picture. We
should create a scatter plot using our bivariate data and verify if there is really a linear relationship.

Are the two variables associated linearly? It is quite clear visually that the total number of confirmed
cases increases exponentially when t increases. Thus, if we let y be the variable representing the total
number of confirmed cases, y and t are not linearly associated but instead the relationship between them
seems to be exponential. For such a situation, can we apply our linear regression technique to make
predictions on the total number of confirmed cases? The answer is yes, but it would have to be done
indirectly.
Now, if the relationship between y and t is indeed exponential in nature, we can model this relationship
using the equation
y = cbt ,
where c and b are some constants that we will determine. Using the property of the logarithmic function,
we see that

y = cbt is equivalent to ln y = ln(cbt ) is equivalent to ln y = ln c + t ln b.

Thus, instead of making a scatter plot with y plotted against t, we will make a scatter plot with ln y
plotted against t. If there is indeed an exponential relationship between y and t, then we would expect
to see a linear relationship between ln y and t, as indicated by the equivalent equations above. Let us go
through the steps:
Section 3.4. Linear regression 101

(a) Step 1: For each data point (t, y), compute (t, ln y). For our data set on COVID-19 cases in South
Africa, we have the following table:

t Total confirmed cases (y) ln(y)


76 17200 9.753
77 18003 9.798
78 19137 9.859
79 20125 9.910
.. .. ..
. . .
95 48285 10.785
96 50879 10.837
.. .. ..
. . .

We then plot ln y against t.

(b) Step 2: Find the linear regression line for ln y vs t. For our example, the regression line was found
to be
ln y = 4.287 + 0.066t.

This means that ln c = 4.287 and ln b = 0.066.

(c) Step 3: Since ln c = 4.287 and ln b = 0.066, we have

c = e4.287 and b = e0.066 .

We are now able to write down the exponential equation relating y and t:

y = cbt = e4.287 e0.066t = e4.287+0.066t .


102 Chapter 3. Dealing with Numerical Data

Exercise 3
1. Outliers are observations that fall well above or below the overall bulk of the data. Consider a set
of 50 (univariate) data points with a single outlier. Suppose the outlier is removed from the data
set, which of the following is/are always true? Select all that apply.

(A) The removal will cause the mean to decrease.


(B) The removal will cause the interquartile range to decrease.
(C) The removal will cause the standard deviation to decrease.
(D) The removal will cause the range to change.

2. The GEA1000 midterm results for the year 2050 Semester 1 are shown in the boxplot below. There
were 50 students who took the test, and the test scores are out of 100. No outliers were removed.

Which of the following can be derived from the boxplot? Select all that apply.

(A) There is at least one outlier.


(B) The range is 40.
(C) The interquartile range is 40.
(D) The standard deviation is 14.

3. Suppose that there are 76 pairs of siblings living in a particular block in Ang Sua, where the
older sibling is always heavier than the younger sibling. Consider a scatter plot using the younger
sibling’s weight to predict the older sibling’s weight, where each point in the scatter plot represents
the weights of a pair of two siblings in the block. Which of the following statements must be true?

(I) There is a positive association between the older and younger siblings’ weights.
(II) All the points lie above the line y = x in the scatter plot.

(A) Only (I).


(B) Only (II).
Exercise 3 103

(C) Neither (I) nor (II).


(D) Both (I) and (II).

4. Consider data sets A, B and C, each consisting of 10,000 numbers with mean 5. The histograms
for A, B and C are shown below.

Order the data sets according to the values of their standard deviations, from the smallest to the
largest.

(A) A, B, C.
(B) A, C, B.
(C) B, A, C.
(D) B, C, A.

5. The five-number summary for a numerical variable X with 77 values is given as 57, 68, 70, 72, 77.
Define Y = 10 − 2X. What is the IQR of Y ?

(A) −8.
(B) −2.
(C) 4.
(D) 8.

6. The boxplot below shows the distribution of the marks of 30 students.


104 Chapter 3. Dealing with Numerical Data

Which of the following statements must be true?

(I) There is only one student who scored higher than 23.5 marks.
(II) The range of the marks of the 30 students is 17.5.

(A) Only (I).


(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).

7. Professor X conducted a test for his class of 16 students, and tabulated the following five-number
summary for the test scores:

Minimum Q1 Median Q3 Maximum


41.20 45.00 50.75 54.12 58.90

Two days later, he discovered, to his horror, that he had made a mistake in the computation of
the test scores, and everyone should get 10 marks more.
The new (and correct) median score is (1) and the IQR is (2) .
Fill in the blanks for the statement above, giving your answers correct to 2 decimal places.
8. Consider the following data set, which we will refer to as set A:
{15, 23, 13, 17, 8, 42, 4, 37, 12, 16}.
A student decided to do a check for outliers, after which such value(s) was/were removed. Let us
designate the set of remaining data points as set B. Which of the following statements is/are true?
Select all that apply.

(A) The range of B is 19.


(B) The median of B is lower than the median of A.
(C) The median of B is greater than the mean of B.
(D) The median of B is lower than the mean of A.

9. The following histogram is constructed using 100 observations of a discrete numerical variable X.
For the first bin, [0, 1], both 0 and 1 are included in the bin. For every other bin, the left endpoint
is excluded while the right endpoint is included.
Exercise 3 105

Based on the histogram, which of the following statements is/are definitely true? Select all that apply.

(A) The distribution is right-skewed.


(B) The maximum value is 8.
(C) The value that occurs the most often in this data set is in the bin (1, 2].
(D) Only a quarter of the observations is larger than 3.

10. The following two diagrams are adapted from a paper published on Nature titled “Irregular
sleep/wake patterns are associated with poorer academic performance and delayed circadian and
sleep/wake timing”, which studied a group of students. The diagrams describe the association
between the variables Grade Point Average (GPA), Sleep Regularity Index (SRI) and Actual Dim
Light Salivary Onset (Actual DLMO). Actual DLMO was not recorded for participants that have
neither regular nor irregular sleep/wake patterns.
106 Chapter 3. Dealing with Numerical Data

Based only on the two diagrams above, which of the following is necessarily true?

(A) If the researchers collected information about the average household income amongst the
participants and found a positive association between average household income and Grade
Point Average, then they may conclude that average household income and SRI are positively
associated amongst the participants.
(B) The predicted Actual DLMO value for a student who has neither regular nor irregular sleep/wake
patterns is less than 24.
(C) A higher SRI value is associated with a lower Actual DLMO value for students who have
regular sleep/wake patterns.
(D) Given a student’s GPA, the researchers should use the equation of the regression line of GPA
against SRI to predict SRI.

11. You’ve been helping a friend generate some nice-looking figures in Radiant. Unfortunately, you
lost track of which data sets were being used for which histograms and boxplots. You don’t want
to make them all over again (though it would be trivial if you saved your script). Which boxplot
(1-4) goes with which histogram (A-D)?

(A) 1C, 2A, 3D, 4B


(B) 1C, 2B, 3D, 4A
(C) 1D, 2A, 3C, 4B
(D) 1B, 2C, 3D, 4A

12. Suppose that the following are 10 data points for a numerical variable X:

16, 82, 72, 100, r, 22, 83, 62, −2, 99,

Where r is an unknown whole number less than 72. It is known that there is only one outlier in
this data set. An outlier is defined as a data point having a value greater than Q3 + 1.5*IQR or
less than Q1 – 1.5*IQR. What is the maximum possible value of r?

13. 3000 students took a multiple-choice quiz in school. The quiz consisted of 10 questions. Each
student answered all 10 questions. For each question answered correctly, 1 mark was awarded, and
for each question answered wrongly, no marks were awarded. There was no partial credit awarded.
The average score was 5, and the standard deviation of the scores was 2.
The number of correct answers and wrong answers for each student was plotted in a scatter plot,
with the number of correct answers represented on the horizontal axis and the number of wrong
answers on the vertical axis.
The correlation coefficient between the number of correct answers and number of wrong answers
is:
Exercise 3 107

(A) 1.
(B) −1.
(C) 0.
(D) Unable to tell from the information provided.

14. Of the five values below, which would be that of a correlation coefficient with the strongest corre-
lation?

(A) −1.4.
(B) −0.9.
(C) 0.3.
(D) 0.7.

15. What will happen to the correlation coefficient between X and Y if a point with coordinates
(80, 110) is added to the scatter plot shown below?

(A) It will increase.


(B) It will decrease.
(C) It will remain the same.

16. A system for marking students’ R computer programs, called markeR, has been used successfully
at a university. markeR takes into account both program correctness and program style when
marking students’ assignments.
To evaluate its effectiveness, markeR was used to grade the R assignments of a class of 40 students.
These scores, which range from 10.5 to 19, were then compared to the scores given by the instructor
of the class. The results are summarised below.

Variable Sample mean Sample standard deviation


markeR score (x) 16.5 1.5
Instructor score (y) 14.5 2.25

The sample correlation between y and x is 0.85. A least squares regression line is used to predict
the average instructor score from the markeR score. We are given that the regression line passes
through the point (16.5, 14.5).
(Fill in the blank.) When the markeR score is 15, the predicted average instructor score is
(rounded to 2 decimal places.)
108 Chapter 3. Dealing with Numerical Data

17. Below is a scatter plot showing preliminary exam and final exam scores for students in a secondary
school along with the linear regression line.

The average scores for the preliminary exam and final exam were both 60, with standard deviations
of 5.1 and 6.6 respectively. What does the slope of 0.98 of the linear regression line predict?

(A) The increase in average final exam scores, corresponding to an increase of 1 mark in the
preliminary exam.
(B) The correlation between the final and preliminary exam scores.
(C) The average final exam score of students who scored 0 on the preliminary exam.
(D) None of the other options.

18. The scatter plot below shows the relationship between height and shoulder girth (circumference of
shoulders measured over deltoid muscles).

The equation of the regression line for height vs shoulder girth is given by y = 0.6x + 106, where y
refers to the height and x refers to shoulder girth. Which of the following statements below is/are
correct? Select all that apply.
Exercise 3 109

(A) If we were to predict average shoulder girth from height using simple linear regression, the
gradient of the regression line is also positive.
(B) Using simple linear regression, when the shoulder girth is equal to 141cm, the predicted average
height is 190.6cm.
(C) Using simple linear regression, when the height of the individual is 170cm, the predicted
average shoulder girth is 106.67cm.
(D) If the shoulder girth of all individuals above are 2cm shorter, then the gradient of the regression
line for height vs shoulder girth is 0.6.

19. A researcher examined the relationship between variables X and Y among 250 male and female
subjects. He graphed the relationship in the scatter plot shown below. Let r be the correlation
coefficient for all 250 subjects, r1 be the correlation coefficient among male subjects only and r2
be the correlation coefficient among female subjects only.

Which of the following correctly describes the relationship between r, r1 and r2 ?

(A) r1 < r < r2 .


(B) r1 > r > r2 .
(C) r > r1 > r2 .
(D) r < r1 < r2 .

20. A researcher is interested in the correlation between the amount of time an individual spends on
social media and the individual’s level of happiness. Suppose that she observed that the correlation
coefficient r1 for males only is 0.8, and that the correlation coefficient r2 for females only is also
0.8. Which of the following statements must be true for r, the correlation coefficient when the data
for males and females are combined?

(A) 0 ≤ r ≤ 0.8.
(B) r = 0.8.
(C) 0.8 < r ≤ 1.
(D) None of the other given options is correct.

21. Based on the scatter plot shown below, which of the following is closest to the equation for the
regression line? Here, W is the weight of the car and C is the consumption.
110 Chapter 3. Dealing with Numerical Data

(A) W = 3 − 0.1C.
(B) W = 5 − 0.1C.
(C) W = 3 + 0.8C.
(D) W = 5 + 0.8C.

22. Which of the following is/are true about a non-zero correlation coefficient? Select all that apply.

(A) The correlation coefficient does not change when we add 5 to all the values of one variable.
(B) The correlation coefficient is positive when the slope of the regression line is positive.
(C) The correlation coefficient does not change when we multiply all the values of one variable by
2.
(D) A correlation of −0.3 is stronger than a correlation of −0.8.

23. The relationship between the number of glasses of beer consumed daily (x) and blood alcohol
content in percentage (y) was studied in young adults. The equation of the regression line is
y = −0.015 + 0.02x for 1 ≤ x ≤ 10. The legal limit to drive in Singapore is having a blood alcohol
content below 0.08%. Des, a young adult, had just finished 5 glasses of beer. After that, he wanted
to take his car out for a drive. Is it legal for him to drive in Singapore?

(A) Yes.
(B) No.
(C) Unable to determine.

24. Three father-son pairs had their heights measured. The following table shows their heights:

Pair Father (inches) Son (inches)


A 68 72
B 70 71
C 66 70
Exercise 3 111

Using these three data points, the standard deviation for the fathers would be 2 and for the sons
it would be 1. From the table, what is the standard unit for the son from pair A?

(A) −1.
(B) 0.
(C) 1.
(D) 1.88.

25. Suppose that there are 40 male students in a class and each student scored 5 less marks for his
maths test than what he scored for his science test. What can we say about their maths and science
test marks? Select all that apply.

(A) The interquartile range of science test marks is higher than that for maths test marks.
(B) If student A scored a higher mark for the maths test than student B, then he must have scored
a higher mark than student B for the science test.
(C) The science test marks and maths test marks are perfectly negatively correlated.
(D) The standard deviation of maths test marks is equal to that of science test marks.

26. The regression line for Y vs X is given by Y = 0.82X + 59.1. The standard deviations for X and
Y are 1.5 and 2.2 respectively. Suppose now we construct a regression line that uses Y to predict
X.
The predicted average increase of X when Y is increased by 1 unit is . (Give your
answer correct to 2 decimal places.)

27. A professor wants to know the percentage of right-handed students in NUS. Since he is teaching
a course in NUS this semester, he decides to do a survey in his class. From the single survey, he
concluded that eighty percent of students in NUS are right-handed. Which one of the following
fallacies was committed by the professor?

(A) Atomistic fallacy.


(B) Ecological fallacy.
(C) None of the other options.

28. The total number of people who are infected by a disease (denoted by y) can be predicted using
the regression model y = 2x+1 − 1, where x is the number of days from the first infection, up till
the 30th day. Based on the information above, which of the following is true?

(A) After 3 days from the first infection, there will be exactly 15 people infected.
(B) If there were 7 people infected, it means that exactly 2 days have passed from the first infection.
(C) After exactly 20 days, there will be approximately less than 2 million people infected.
(D) The relationship can be modelled as a simple linear regression Y = mX + c, where Y = y,
X = 2x , m = 2, and c = −1.

29. Bivariate numerical data can be represented in the form (x, y). Which of these 4 data sets, af-
ter having added an additional data point (2, 8), would have the magnitude of their correlation
coefficient decrease as a result? Select all that apply.

(A) (2, 2), (8, 2), (8, 8)


(B) (2, 2), (4, 5), (6, 2)
(C) (2, 2), (5, 5), (8, 8)
(D) (2, 8), (5, 5), (8, 2)
112 Chapter 3. Dealing with Numerical Data

30. ”The relation between anxiety and BMI - is it all in our curves?” was published in the journal
Psychiatry Research in 2016. As stated in the abstract of that research paper, ”The relation
between anxiety and excessive weight is unclear. The aims of the present study were three-fold:
First, we examined the association between anxiety and Body Mass Index (BMI). Second, we
examined this association separately for female and male participants...”
The first result reported was: No linear correlation between anxiety scores and BMI among all
the participants was observed. If the researchers had not proceeded to investigate the associa-
tion between anxiety scores and BMI separately for female and male participants, but concluded
straightaway from their first result that ”there is no linear correlation between anxiety scores and
BMI among the females and among the males separately’, what mistake would they have commit-
ted?

(A) Ecological fallacy


(B) Atomistic fallacy
(C) Confusing correlation and causation
(D) None of the other options is correct
This page is blank
114 Chapter 3. Dealing with Numerical Data
Chapter 4

Statistical Inference

Section 4.1 Probability

In Chapter 1, we introduced the following types of research questions that are of interest.

1. To make an estimate about the population.

2. To test a claim about the population.

3. To compare two sub-populations / to investigate a relationship between two variables in the pop-
ulation.

You would have noticed that a common term that recurs in the three research questions is the word
population. Indeed these are all questions pertaining to the population. In order to answer these
questions, we would need to have complete information about the entire population. This is usually not
possible due to the sheer size of the population.
In order to give an approximate answer to the research questions, we need to use a sample of the
population. The process of drawing conclusions about the population from sample data is known as
statistical inference.

Recall the PPDAC cycle, first introduced in Chapter 1. In particular, when we focus on the second
to fourth phases of the cycle, “Plan”, “Data” and “Analysis”, these phases involve specialised tools
and techniques. These tools and techniques lead us to take a closer look at how these three phases are
inter-related.
116 Chapter 4. Statistical Inference

The “Plan” and “Data” phases were discussed in Chapter 1. How is a sample obtained from a
population? What are the different methods of sampling and what are the types of biases we need to
avoid? From summary statistics introduced in Chapter 1, to how categorical variables can be analysed
in Chapter 2 and likewise for numerical variables in Chapter 3, these are all tools that we can use under
the Analysis phase.
To conclude the “Analysis” phase, we need to look at the results of our analysis of the sample and
subsequently draw conclusions on the population. This is where statistical inference comes into the
picture. In order to have a meaningful discussion on statistical inference, we need to acquire some
knowledge about probability. Probability and inference will form the main thrust of this chapter. To
begin, we will introduce some basic results in probability which allow us to discuss the tools required in
statistical inference. The two kinds of tools most common in statistical inference are confidence intervals
and hypothesis tests, both of which will be discussed in some detail in the rest of the chapter.
Let us lay the groundwork for probability by defining some basic terms that are necessary in this
subject.

Discussion 4.1.1 In previous chapters, we have occasionally touched on the notion of uncertainty.
When we use the word “chance” it is understood intuitively that something is not definite, or not certain
to hold or occur. In order to compare the likelihood of occurrence, we use terms like “more likely” or
“less likely”. These terms are common and adequate for everyday use. However, they are not precise
and as we deal with data at a deeper level, we need a rigorous framework to ground the concept of
uncertainty. Probability is a mathematical means that we can use to reason about uncertainty.
Consider a coin, with one side called “heads”, and the other called “tails”. Let’s say the coin is tossed
twice and the side that faces up when the coin lands is observed for both tosses. What are the possible
outcomes after two tosses? If we represent observing heads as H and observing tails as T , then the four
possible outcomes are:
HH, T T, HT, T H.

Here HT is differentiated from T H as HT means heads was observed in the first toss and tails in the
second toss, while T H means tails was observed first, followed by heads. In this example, the procedure
of tossing the coin twice is called a probability experiment. The set {HH, T T, HT, T H} contains the
outcomes of the probability experiment.
It should be noted that the probability experiment defined here is narrower than the type of exper-
iments described in Chapter 1. A probability experiment must be repeatable and allows for the exact
listing of all the possible outcomes, like the way we have listed down all the four possible outcomes of
the probability experiment of tossing a coin twice.
A sample space is the collection of all possible outcomes of a probability experiment. A sub-collection
of the sample space is called an event. Referring back to our coin tossing probability experiment:
Section 4.1. Probability 117

1. The sample space, as indicated earlier is:

HH, T T, HT, T H.

2. An event could be
HH, T T.
We can describe this event as two in a row or two identical observations.

3. Another event could be


T T, T H, HT.
This event can be described as at least one tail.

4. Another event could be


HT.
We can describe this event as first toss is heads, second toss is tails.
Note that an outcome can also be considered an event but not all events are outcomes!

Having understood a probability experiment, the sample space of the experiment and an event of the
sample space, we are now ready to give context to the mathematical discussion of probability.

For a probability experiment with an associated sample space, the probability


of an event of the sample space is the total probability that the outcome of the
experiment is an element of the event.

For example, in our coin tossing experiment, the probability of at least one tail is how likely the
outcome of the experiment will be T T , T H or HT .

Example 4.1.2 Another common example of a probability experiment is the rolling of a six-sided die.
It is obvious that such an experiment can be repeated as many times as we wish to and it is also easy to
list down all the outcomes of a single roll of the die.

1. Probability experiment: rolling (once) a six-sided die and observe the top facing side.

2. Sample space:
1, 2, 3, 4, 5, 6.
Rather than to use a list, we can put all the outcomes into a set,

{1, 2, 3, 4, 5, 6}.

3. An example of an event:
2, 4, 6.
This event can be described as “die shows an even number”. As an event is a sub-collection of
the sample space, when we represent the sample space as a set, an event will be a subset of the
sample space, for example
{2, 4, 6}.

As an exercise, you may wish to write down the sample space (as a set) for the probability experiment
of rolling a six-sided die twice and observing the top facing side on both rolls. Follow this by describing
the event and writing down the subset of the sample space that represents the event.

Notation 4.1.3 Suppose E is an event, then P (E) is the probability of event E. Probabilities are
numerical values between 0 and 1 (both inclusive), so P (E) takes on a numerical value between 0 and 1
and this is the probability assigned to event E.
118 Chapter 4. Statistical Inference

Discussion 4.1.4 The question now is how do we know which numerical value between 0 and 1 to
assign to an event E? In other words, how do we know the probability P (E)? Mathematically, we
can define P (E) as the long run proportion of observing E when a large number of repetitions of the
experiment is being performed. Thus, we can repeat the probability experiment a large number of times
(say N times) and each time we observe if the outcome is an element of the event E. Suppose the first
experiment’s outcome is in E, then we mark that experiment with a “YES”. Repeat the experiment
again, suppose now the outcome is not in E, then we mark this experiment with a “NO”. Continue this,
till we have done the experiment N times and each time we have either a “YES” or a “NO”.
We now count the number of “YES” we have, out of the N times the experiment was done. Then
the probability of event E, P (E) is estimated by

number of “YES”
.
N
It should be noted that

1. The estimate of P (E) we obtain from these N repetitions of the experiment is likely to be
different if we repeat the experiment (and get another estimate) another N times.

2. Such estimates get more accurate and closer to the true value of P (E) as the number N becomes
larger.

Example 4.1.5 Let us return to our die rolling experiment.

1. Probability experiment: rolling (once) a six-sided die and observe the top facing side.

2. Sample space: {1, 2, 3, 4, 5, 6}.

3. Event E = die shows an even number, that is E = {2, 4, 6}.

What would be an estimate of P (E)? Suppose we repeatedly roll (and observe) the die 500 times and
recorded the “YES” and “NO” as follows:

Roll 1 2 3 ··· 499 500


Outcome 2 3 1 ··· 6 2
Outcome belong to E? YES NO NO ··· YES YES

Suppose the total number of “YES”, out of the N = 500 times the experiment was carried out, is
268, then an estimate of P (E) is
268
P (E) = = 0.536.
500

Rules of Probabilities
It is virtually impossible to verify what is the true probability for an event of a probability experiment.
For example, even if we say that a coin is “fair” does it mean that the probability of “heads” is exactly
0.5 and that for ’tail’ is exactly 0.5? Probably (pun intended) not! The probabilities that we encounter in
everyday life are just estimates of what the true probability is but in the analysis of data, it is sufficient
to treat the estimates as if it is the true probability. What is important and relevant in the use of such
estimates is that in the assignment of probabilities to events of a probability experiment, the following
rules of probabilities must be obeyed.

1. The probability of each event E, denoted by P (E) is a number between 0 and 1 (inclusive).

2. If we denote the entire sample space by S, then the probability of S, P (S) is 1.

3. If E and F are mutually exclusive events (meaning both events cannot occur simultaneously),
then the probability of E union F is equal to the sum of the probabilities of E and F . That is,
P (E ∪ F ) = P (E) + P (F ).
Section 4.1. Probability 119

When the sample space contains only a finite number of outcomes, we only need to assign probabilities
to the outcomes so that these probabilities sum up to 1. The probabilities of all other events can then
be derived from there.

Example 4.1.6 Suppose we have a biased six-sided die being rolled once. The following probabilities
are assigned to the six possible outcomes.

Outcome 1 2 3 4 5 6
Probability 0.1 0.1 0.1 0.1 0.1 0.5
Check that the probabilities add up to 1. We are now able to derive the probabilities of certain events by
applying the third rule of probability as stated above. For example, if E is the event “an odd-numbered
face” and F is the event “an even-numbered face”, it is easy to see that
1. P (E) is the sum of P (1), P (3) and P (5), so P (E) = 0.3. (Here “1”, “3”, “5” are mutually exclusive
events.)

2. P (F ) is the sum of P (2), P (4) and P (6), so P (F ) = 0.7. (Here “2”, “4”, “6” are mutually exclusive
events.)

3. E and F are mutually exclusive events, so P (E ∪ F ) = P (E) + P (F ) = 0.3 + 0.7 = 1.

Definition 4.1.7 Uniform probability is the way of assigning probabilities to outcomes such that equal
probability is assigned to every outcome in the finite sample space. Thus, if the sample space contains
a total of N different outcomes, then the probability assigned to each outcome is
1
.
N
As an example, if the sample space S contains the outcomes of flipping a fair coin twice, then

S = {HH, HT, T H, T T }.
1
Using uniform probability, we will assign the probability of 4 or 0.25 to each of the four outcomes.

Example 4.1.8 We have in fact seen uniform probability in action much earlier in Chapter 1 when
simple random sampling was introduced. Recall that in simple random sampling, r units from the
sampling frame are randomly selected to be included in the sample. We are conducting a probability
experiment where the sample space is the sampling frame that contains all the units that could possibly
be selected. The probability of selecting a particular unit at the first draw from the sampling frame is
thus N1 where N is the size of the sampling frame.
Furthermore, for any subset of the sample space (an event) denoted by A, the probability of this
event, P (A) is interpreted as the likelihood of selecting a unit belonging to A into the sample. This is
equal to the rate of A in the sampling frame.
As a concrete example, suppose the sampling frame consists of 500 adults, comprising of 280 males
1
and 220 females. By simple random sampling, each adult, for example, John, has a probability of 500 to
be selected at the first draw. So the probability of a person selected at the first draw being male would
be 280
500 .
120 Chapter 4. Statistical Inference

If we define A to be the subset of the sample space consisting of the male adults, then the probability
of event A is the rate of A in the sampling frame. That is,

280
P (A) = rate(male) = = 0.56.
500

Discussion 4.1.9 (Sally Clark’s story) The story of Sally Clark’s trial sets the stage for our further
discussion on probability. In 1998, Sally Clark, a British woman was accused of having killed her first
child at 11 weeks of age and then her second child at 8 weeks of age. Sir Roy Meadow, who was a
professor and consultant paediatrician appeared as an expert witness for the prosecution.
In his testimony, Sir Meadow commented that the chance of two children from an affluent family
dying from Sudden Infant Death Syndrome (SIDS) is 1 in 73 million.
The jury eventually convicted Sally Clark by a 10-2 majority and she was given the mandatory
sentence of life imprisonment. Sensationally, it was later discovered that the probability that Sir Meadow
calculated was misrepresented. In 2001, the Royal Statistical Society (RSS) decided to issue a public
statement expressing its concern at the misuse of statistics in the courts. In the statement, the RSS
refuted that there was no statistical basis for the figure of 1 in 73 million quoted by Sir Meadow. A year
later, in 2002, the Society wrote to the Lord Chancellor pointing out that the calculation leading to the
figure of 1 in 73 million was erroneous.
So what went wrong with Sir Meadow’s computation? To discuss this further, we will need to learn
two concepts of probability namely, conditional probability and independence. They will be covered in
the next section and we will continue with the story of Sally Clark subsequently.

Section 4.2 Conditional Probability and Independence

Let us begin this section by using the same example as Example 4.1.8 . Suppose we have 500 adults,
comprising of 280 males and 220 females, as participants in a lucky draw where there is only one prize
1
to be won. Under uniform probability, each person, for example, John has a probability of 500 to be the
winner of the prize.
Now suppose it is known that the winning ticket will be drawn from the male participants, what is
John’s probability of being the winner of the prize now?

Definition 4.2.1 The scenario described above involves the concept of conditional probability. Condi-
tional probability is normally written using the notation

P (E | F )

and is read as “probability of E given F ”. Here, E and F are events of a particular sample space. With
reference to our lucky draw example above, events E and F are:
E: Winner of the prize is John;
F : Winner of the prize is a male.
So the conditional probability P (E | F ) is the probability that John is the winner given that it
is a male who won. Intuitively, the probability of E given F measures how likely the outcome of the
probability experiment is an element of E, if we already know that it is an element of F . To
compute conditional probabilities, we use the idea of restricting the sample space based on the condition
that event F is known to have occurred.
Section 4.2. Conditional Probability and Independence 121

More precisely, to compute the probability of E given F , we restrict our focus on the given event F
as our restricted sample space (rather than to look at the entire sample space). The event F may or
may not contain overlap with event E. The overlap is denoted by E ∩ F , to be read “E intersect F ”.
The probability of E given F is obtained by dividing the probability P (E ∩ F ) by the probability P (F )
which acts as the baseline (restricted sample space). Thus,

P (E ∩ F )
P (E | F ) = .
P (F )

Remark 4.2.2

1. It is perfectly possible that there is no overlap between events E and F , meaning that it is not
possible that E and F happen simultaneously. In such a situation, it is clear that the probability
that event E occurs given that event F is known to have occurred is certainly 0. Indeed, with
P (E ∩ F ) = 0, we see that
P (E ∩ F )
P (E | F ) = = 0.
P (F )

2. If event F itself cannot occur, that is P (F ) = 0, then we will stipulate by convention that P (E | F )
is also equal to 0.

Example 4.2.3 In our lucky draw example, we have

P (F ) = probability that the winner is a male


280
= .
500
P (E ∩ F ) = probability that the winner is John and winner is a male
= probability that the winner is John
1
= .
500
P (E | F ) = probability that the winner is John given that the winner is a male
P (E ∩ F )
=
P (F )
1 500 1
= × = .
500 280 280
So the conditional probability that John is the winner, given that the winner is a male is

1 1
= .
280 number of males in total

Discussion 4.2.4 (Conditional probabilities as rate) We have seen earlier that uniform probabil-
ities are manifested as the probability experiment of randomly selecting a unit from a fixed sampling
frame. The table below draws the analogy between the two interpretations.
122 Chapter 4. Statistical Inference

Random sampling Corresponds to Probability experiment


Sampling frame Corresponds to Sample space
A subgroup A of the sampling frame Corresponds to An event A of the sample space
The rate of A, rate(A) Corresponds to The probability of A, P (A)

What about for conditional probabilities? Will there be a similar correspondence to conditional rates?
More specifically, is the conditional probability of A given B equal to the rate of A given B whenever A
and B are subgroups of the sampling frame? The following derivation shows that they are indeed equal.

P (A ∩ B)
P (A | B) = (by using the idea of restricted sample space)
P (B)
rate(A ∩ B)
= (by the correspondence between probability and rates)
rate(B)
size of (A ∩ B) size of B
= ÷
size of sampling frame size of sampling frame
(by the definition of rates as ratios of two sizes)
size of (A ∩ B)
=
size of B
= rate(A | B) (by the definition of rates as ratios of two sizes)

So indeed, this derivation shows that just as probabilities are equivalent to rates in this probability
experiment, conditional probabilities are also equivalent to conditional rates.

Discussion 4.2.5 Let us continue from Discussion 4.1.9 with the story of Sally Clark’s trial. So where
did Sir Meadow make the mistake in his calculation? It turns out that the error was due to misconception
of the conditional probability. The calculation done by Sir Meadow is similar to the following.

P (Evidence as observed | Clark is innocent)


= P (Two infant deaths in the same family | Clark is innocent)
= P (First infant death | Clark is innocent) × P (Second infant death | Clark is innocent)
1 1
= ×
8543 8543
1
= .
72982849
The computed probability, which was really small, was what prompted Sir Meadow to claim that what
was observed had a 1 in 73 million chance of happening. However, note that the probability computed
above was
P (Evidence as observed | Clark is innocent),
which was mistaken to be

P ( Clark is innocent | Evidence is observed),

which was what the prosecutor was really meant to impress on the jury. The intention was to illustrate
what is the chance that Clark is innocent given the evidence as observed.
For two events A and B, the mistake of confusing P (A | B) as P (B | A) is known as the prosecutor’s
fallacy. In general, we note that P (A | B) is not equal to P (B | A). To see why this is so, we use the
definition of conditional probability as described in Definition 4.2.1.

P (A ∩ B)
P (A | B) =
P (B)
P (B ∩ A)
P (B | A) =
P (A)
Section 4.2. Conditional Probability and Independence 123

It is now easy to see that for P (A | B) = P (B | A), we require either P (A) = P (B) or P (A ∩ B) = 0.
This is not always the case and therefore the two conditional probabilities are not necessarily equal.
The story of Sally Clark does not end here. Other than the confusion with conditional probabil-
ities, Sir Meadow’s calculation was also erroneous on another count, one that involves the concept of
independence which is what we will be discussing next.

Definition 4.2.6 When we say that two events A and B are independent, it means that
P (A) = P (A | B),
that is, the probability of A is the same as the probability of A given B. So, the fact that event B has
occurred does not affect the probability of A occurring. Now, if we express the conditional probability
P (A | B) as
P (A ∩ B)
P (A | B) = ,
P (B)
then A and B being independent means that
P (A ∩ B)
P (A) = which implies P (A) × P (B) = P (A ∩ B).
P (B)
We thus have an equivalent definition of what it means for two events to be independent.

(Independence as non-association) Consider studying a population along two categorical vari-


ables, one with categories “A” and “Not A” and the other variable with categories “B” and “Not B”.
Such studies have been discussed in Chapter 2 and one of the main questions concern whether there is
association between the two categorical variables. We used a 2 × 2 contingency table similar to the one
below, to compute and compare the conditional rates.
B Not B
A
Not A
Recall that by the basic rule on rates, we say that the two variables are not associated if
rate(A) = rate(A | B).
Since we have drawn the correspondence between rates and probabilities, and conditional rates as con-
ditional probabilities, this leads us to conclude that A and B are independent events whenever A and B
are not associated with each other.
Here, the relevant probability experiment that draws the correspondence is one that involves randomly
selecting one unit from the population we are studying and checking the values (presence or absence) of
the selected unit with regards to the two categorical variables (with/without A and with/without B).

Definition 4.2.7 The notion of independence can be extended to events that are conditionally indepen-
dent. We say that two events A and B are conditionally independent given an event C with P (C) > 0
if
P (A ∩ B | C) = P (A | C) × P (B | C).

Discussion 4.2.8 Let us continue where we left off from Discussion 4.2.5 and discuss the second error
in Sir Meadow’s calculation. Recall that the calculation was done in the following manner

P (Evidence as observed | Clark is innocent)


= P (Two infant deaths in the same family | Clark is innocent)
= P (First infant death | Clark is innocent) × P (Second infant death | Clark is innocent)
1 1
= ×
8543 8543
1
=
72982849
124 Chapter 4. Statistical Inference

We have already noted that the first conditional probabilty was misinterpreted as P (Clark is innocent |
Evidence as observed), resulting in the prosecutor’s fallacy. Now let us focus on the second equality

P (Two infant deaths in the same family | Clark is innocent)


= P (First infant death | Clark is innocent) × P (Second infant death | Clark is innocent)

If we represent the events “First infant death”, “Second infant death”, and “Clark is innocent” as A,
B and C respectively, the equation above is now

P (A ∩ B | C) = P (A | C) × P (B | C).

This is precisely what we saw in the definition for conditional independence. So by computing the
probability this way, Sir Meadow had assumed that the event where the first child died of SIDS and the
event where the second child died of SIDS are independent of each other. Is this a reasonable or even
intended assumption?
While we cannot say for certain, there could very well be factors (like genetic or environmental) that
predispose families to SIDS. What this means is that a second case of SIDS within the family becomes
much more likely to happen than it would be in another apparently similar family.
To conclude the story of Sally Clark’s trial, in January 2003, the guilty verdict was overturned on a
second appeal and Sally was eventally released from prison having served more than three years of her
sentence.

Section 4.3 Conjunction Fallacy, Base Rate Fallacy and Ran-


dom Variables
We start this section with an example that introduces the law of total probability.

Example 4.3.1 Suppose there are two bags, each containing 10 colored balls. Bag A contains 7 red
balls and 3 green balls while Bag B contains 2 red balls and 8 green balls. One bag is randomly selected
and a ball is then randomly selected from the chosen bag. What is the probability that the selected ball
chosen is green?
Let us consider the events E, F and G such that

ˆ E is the event that Bag A is selected.

ˆ F is the event that Bag B is selected.

ˆ G is the event that the selected ball is green.

Note that E and F are mutually exclusive and either E or F must occur. The probability required
P (G) is the probability of (G and E) plus the probability of (G and F ). That is,

P (G) = P (G ∩ E) + P (G ∩ F ).

By the definition of conditional probability, this is equivalent to

P (G) = P (G | E) × P (E) + P (G | F ) × P (F ).

In our example, this means that the probability that a green ball is selected is the sum of the probabilities

P (green ball selected | Bag A is selected) × P (Bag A is selected) and

P (green ball selected | Bag B is selected) × P (Bag B is selected).


Section 4.3. Conjunction Fallacy, Base Rate Fallacy and Random Variables 125

Formally, the law of total probability states that if E, F and G are events from the
same sample space S such that

(1) E and F are mutually exclusive; and

(2) E ∪ F = S.

Then,
P (G) = P (G | E) × P (E) + P (G | F ) × P (F ).

Recall that through the recounting of the trial of Sally Clark, we explained what is commonly known
as the prosecutor’s fallacy where one confuses P (A | B) with P (B | A). In the remaining of this section,
we introduce two more fallacies which are common pitfalls and misunderstandings related to probabilities.

Example 4.3.2 Suppose Joseph was seen to be acting nervously and loitering outside a convenience
store. Furthermore, he had a knife in his pocket. Based on this observation, which would you say is
more probable about Joseph?

(1) Joseph is a thief.

(2) Joseph is a thief and is jobless.

If you think that (2) is more probable than (1), you would have committed a conjunction fallacy.
Formally, for two events A and B, one would have committed conjunction fallacy if one believes that

P (A ∩ B) > P (A) or P (A ∩ B) > P (B).

In words, the fallacy is when one believes that the chances of two things happening together is higher
than the chance of one of those things happening alone.
In fact, what is actually true is the following

P (A ∩ B) ≤ P (A) and P (A ∩ B) ≤ P (B).

So going back to the two statements about Joseph, it cannot be the case that Statement (2) is more
probable than Statement (1).
Let us discuss another fallacy that is related to the discussion of rates from Chapter 2.

Discussion 4.3.3 Suppose there is a rise in the number of drink-driving cases in the city. The police
force, with the help of some researchers has developed a new breathalyser to better detect drivers who
drive after consuming an excessive amount of alcohol (for the ease of discussion, these drivers are said
to be drunk).
Through product testing, it is known that:

(1) When a driver is sober, there is a 5% chance that the breathalyser will falsely detect that the driver
is drunk.

(2) On the other hand, when the driver is in fact drunk, the breathalyser will always detect that the
driver is drunk.

Here comes the important information. Suppose it is known that 1 in 1000 drivers will perform the
dangerous act of driving when drunk.
Situation: Suppose a driver is picked up randomly at a spot check and takes the breathalyser test.
If the breathalyser indicates that he is drunk, what is the probability that he is indeed drunk? In terms
of conditional probability, we are interested in computing P (Drunk | positive test). In reality, this is an
important consideration because a sober person that returns a positive breathalyser test result may end
up being falsely accused of breaking the law!
126 Chapter 4. Statistical Inference

Now, if we simply consider facts (1) and (2) obtained from the product testing, we may be inclined
to think that P (Drunk | positive test) is quite high, since the breathalyser test never fails in detecting a
drunk driver. Unfortunately, if you are led to make such a conclusion based on (1) and (2) alone, you
would have committed a base rate fallacy.

Definition 4.3.4 The base rate fallacy is a decision making error in which information about the rate
of occurrence of some trait in a population, called the base rate information, is ignored or not given
appropriate weight.
Let us return to our breathalyser example and work out what is in fact the conditional probability
P (Drunk | positive test).

Example 4.3.5 The information where it is known that 1 in 1000 drivers will drive when they are drunk
is precisely that base rate information that cannot be ignored. Otherwise, we would have committed
the base rate fallacy. Similar to the method used in Chapter 2, we will construct a 2 × 2 contingency
table to help us with our calculations. The table below shows how it would look like after the cells are
populated with numbers. The sequence ((1), (2), (3), etc.) to populate the cells is given below the table
and indicated in each cell.

Positive test Negative test Total


Drunk driver (4) 100 (4) 0 (2) 100
Sober driver (5) 4995 (6) 94905 (3) 99900
Total (7) 5095 (7) 94905 (1) 100000

(1) Suppose there are 100000 drivers in total.

(2) Since 1 in 1000 drivers drives after drinking, we know that the total number of drunk drivers is
100.

(3) This means that 99900 are sober drivers.

(4) Since the breathalyser never fails to detect a drunk driver, all 100 of them will be tested positive
and none will be tested negative.

(5) Since the breathalyser falsely detects 5% of sober drivers as drunk, 0.05 × 99900 = 4995 sober
drivers will be tested positive.

(6) This implies that 94905 of the sober drivers will be tested negative.

(7) We can now write down the total number of drivers tested positive and the total number of drivers
tested negative.
100
The conditional probability P (Drunk | positive test) can now be calculated easily as 5095 = 0.019627
which is approximately 2%. This is a very low probability and contrary to the earlier belief that a driver
tested positive is very likely to be drunk. This illustrates the danger of committing the base rate fallacy,
when the base rate of the occurence of a trait in the population was not taken into consideration.
Designing kits or apparatus to do such tests usually involves balancing the risk of having someone
testing positive when he should not be, or testing negative when he is in fact supposed to be positive.
Such considerations are common, especially in medical diagnostics as the next example further illustrates.

Example 4.3.6 Most of us should be familiar with the term ART, or Antigen Rapid Test, by now.
This test is used to test for the presence of COVID-19 infection in humans. For most medical diagnostic
tests, there are four possible scenarios that can happen when the test is administered to an individual
to assess if the individual is infected. The possible scenarios are:

1. Scenario 1: Individual is known to be infected and test shows positive.

2. Scenario 2: Individual is known to be infected and test shows negative.


Section 4.3. Conjunction Fallacy, Base Rate Fallacy and Random Variables 127

3. Scenario 3: Individual is known to be not infected and test shows positive.

4. Scenario 4: Individual is known to be not infected and test shows negative.

Scenario 1 is concerned with the conditional probability of an individual being tested positive, given
that the individual is infected. This is known as the true positive rate. This probability

P (Test positive | Individual is infected)

is known as the sensitivity of the test. For this example, let us assume that this probability is 0.8.
On the other hand, scenario 4 is concerned with the conditional probability of an individual being
tested negative, given that the individual is not infected. This is known as the true negative rate. This
probability
P (Test negative | Individual is not infected)
is known as the specificity of the test. For this example, let us assume that this probability is 0.99.
In reality, these two conditional probabilities are not helpful to average users like ourselves because we
do not really know whether we are indeed infected or not. What we do know, with certainty is whether
the individual’s test returns positive or negative. Therefore, instead of the conditional probability

P (Test positive | Individual is infected)

which is difficult to ascertain if the “condition” is fulfilled, we look at the conditional probability

P (Individual is infected | Test positive).

It is important to gain insight into this conditional probability as it can cause an individual much distress
after being tested positive only to find out later that the person involved is actually not infected. To
determine this conditional probability, having only the sensitivity and specificity of the test is insufficient.
As discussed in the preceding example, we require one additional piece of information, which is the base
rate of infection in the population. This is the infection rate in the population and we can interpret this
as the probability of a person selected at random from the population is infected with COVID-19. For
this example, let us assume that 1% of the population is infected with COVID-19, so

P (Individual is infected) = 0.01.

We will again use a contingency table to study the rates. To start, we choose a large enough number to
represent the total population such that our calculations would result in whole numbers. Let us assume
that the population consists of 100000 individuals.

Tested positive Tested negative Row total


Infected with COVID-19
Not infected with COVID-19
Column Total 100000

Using the information we have for the base rate of infection, we can now fill in the row total for those
that are infected (= 1% × 100000 = 1000) and those that are not infected (= 99% × 100000 = 99000).

Tested positive Tested negative Row total


Infected with COVID-19 1000
Not infected with COVID-19 99000
Column Total 100000

Next, using the true positive rate (sensitivity) of 0.8, we see that 80% of those infected would be
tested positive, that is,

Number of tested positive and Infected = 0.8 × 1000 = 800, and


128 Chapter 4. Statistical Inference

Number of tested negative and Infected = 0.2 × 1000 = 200.


Similarly, using the true negative rate (specificity) of 0.99, we see that 99% of those not infected would
be tested negative, that is,

Number of tested negative and Not infected = 0.99 × 99000 = 98010 and

Number of tested positive and Not infected = 0.01 × 99000 = 990.

Tested positive Tested negative Row total


Infected with COVID-19 800 200 1000
Not infected with COVID-19 990 98010 99000
Column Total 100000

The table can now be completed by summing up the column totals for those tested positive and those
tested negative.

Tested positive Tested negative Row total


Infected with COVID-19 800 200 1000
Not infected with COVID-19 990 98010 99000
Column Total 1790 98210 100000

By now, you should appreciate the choice of 100000 as the total population, as we did not have to deal
with the awkward situation of not having whole numbers when we are dealing with human individuals.
We are now able to calculate the rate of COVID-19 infection among those tested positive. Since there
are 1790 individuals tested positive, and 800 of them are infected, the rate is
800
rate(Infected | Tested positive) = = 0.447 (rounded to 3 significant figures).
1790
Using the correspondence between conditional rates and conditional probabilities, we are now able to
say that if an individual is tested positive for COVID-19 infection using an ART, the probability of
him actually being infected is about 0.45. This conditional probability is rather low so typically, more
rigorous tests need to be conducted to ascertain if the individual is indeed infected with COVID-19.
To conclude this section, we will give a very brief introduction to the concept of a random variable.

Definition 4.3.7 A random variable is a numerical variable with probabilities assigned to each of the
possible numerical values taken by the numerical variable.

Example 4.3.8 Consider the probability experiment of rolling a six-sided die. The faces of the die are
denoted 1, 2, 3, . . . , 6. In this experiment, the possible outcomes are 1, 2, 3, . . . , 6. If we let Y be the
numerical variable that represents the outcome of this experiment with the assigned probabilities
1 1 1
P (Y = 1) = , P (Y = 2) = , P (Y = 3) = ,
3 3 12
1 1 1
P (Y = 4) = , P (Y = 5) = , P (Y = 6) = ,
12 12 12
then Y is a random variable.

Definition 4.3.9 If the numerical variable is a discrete variable, we call the random variable a discrete
random variable. On the other hand, if the numerical variable is a continuous variable, then the random
variable is a continuous random variable.
The random variable Y in Example 4.3.8 is a discrete random variable. It is common to use a table
similar to the one below to illustrate a discrete random variable and the associated probabilities for each
outcome.
Section 4.4. Statistical Inference and Confidence Intervals 129

Outcome 1 2 3 4 5 6
1 1 1 1 1 1
Probability 3 3 12 12 12 12

Unlike discrete random variables which may take on only a countable number of distinct values, a
continuous random variable takes an infinite number of possible values. Height, weight and time required
to run one kilometer are some examples of continuous random variables. Unlike discrete random variables,
a continuous random variable is not defined at specific values.

Example 4.3.10 While a discrete random variable X takes on a countable number of distinct values, a
continuous random variable Y can be visualised by a “continuous series of points” which forms a density
curve on the standard x and y-axes.

In particular, a continuous random variable is defined over an interval of values and is represented
by the area under the density curve.

For the density curve shown above, the probability that Y assumes a value between 0.3 and 0.5 is
the area under the density curve of Y in the interval [0.3, 0.5] as indicated by the shaded region. In this
example, this area turns out to be 0.311 and we write

P (0.3 ≤ Y ≤ 0.5) = 0.311.

In general, the probability that a continuous random variable takes on a value in an interval [a, b] is equal
to the area under its density curve from a to b.

Section 4.4 Statistical Inference and Confidence Intervals


In Chapter 1, through the introduction of various methods of sampling, we discussed the generalisability
criteria. In particular, if a sample is collected and used in a study, is the result of the study representative
130 Chapter 4. Statistical Inference

of the population where the sample was taken from? In order to answer this question, we need to first
know the survey methodology used to generate the sample and then secondly the statistical methods
used to infer this finding about the target population. This second phase is statistical inference.

Definition 4.4.1 Statistical inference refers to the use of samples to draw inferences or conclusions
about the population in question.
The figure above shows how statistical inference fits into the exploratory data analysis (EDA) frame-
work. When given a sample, we learnt how to generate questions, visualise and summarise data and refine
our questions before starting another round of further data exploration and visualisation. However, these
findings are all at the sample level and the natural question would be whether similar conclusions can
be made at the population level. Examples of population level information could be about a population
parameter or whether two categorical variables are associated with each other in the population.
Recall the notion of a census, previously defined in Chapter 1. To obtain definitive conclusions about
the population, one would have to take a census, which may not be possible or desirable. Some possible
reasons why taking a sample is preferred over conducting a census are:

(i) Cost. A census requires measurement of every unit in the population which, even when possible,
can be very costly. For example, in a study on the mental health of the Singaporeans, a census would
amount to measuring the state of mental wellness of every Singaporean. This require resources
which may be beyond the reach of the researcher conducting the study.

(ii) Feasibility. Imagine instead of taking a sample of your blood for a blood test, the doctor tells
you that in order to find out if you are free from a particular illness, he would need to take ALL of
your blood? While this example is a bit far fetched, many other instances exist where it is simply
not feasible to conduct a census.

Recall that the population parameter is a numerical fact about the population, and something that
is of interest to a researcher conducting a study. When we take a sample from the population, the use of
a sample statistic to estimate the population parameter is subjected to inaccuracies. These inaccuracies
primarily come under two categories, namely bias and random error. So we typically have

Sample statistic = population parameter + bias + random error.

Ideally, we want the sample statistic to be as close to the population parameter as possible so that it is
a good estimate of the population parameter. In order to use our sample to make inference about the
population, the fundamental rule for using data for inference should be met.

Fundamental Rule for using Data for Inference


Available data can be used to make inferences about a much larger group if the data
can be considered to be representative with regards to the question of interest.
Section 4.4. Statistical Inference and Confidence Intervals 131

By adopting good sampling methods (e.g. using simple random sampling) and practices (e.g. having
a good sampling frame), selection bias can be reduced. In addition, having a high response rate will
minimise non-response bias. If bias can be reduced to an insignificant level, this would allow us to say

Sample statistic = population parameter + random error.

What stands between the sample statistic and the population parameter is random error. This quantity
refers to the small differences that arise as a result of the sampling variability when using any probability-
based sampling method.
In what follows, we will discuss two types of statistical inference, namely confidence intervals and
hypothesis testing.

Example 4.4.2 Consider the following screenshots that show 10 simple random samples, each of size
2500, drawn from a data set containing information on the distances covered by various airplane flights1
.

Notice that the sample means (average distances covered by the 2500 flights) of the 10 samples are all
different. What can we infer the population mean to be? We require the concept of confidence intervals.

Definition 4.4.3 A confidence interval is a range of values that is likely to contain a population pa-
rameter based on a certain degree of confidence. This degree of confidence is called the confidence level
and is usually expressed as a percentage (%).

Some examples of population parameters that we would like to construct confidence intervals for are
proportion, mean and also standard deviation. For this course, we will focus on the construction of
confidence intervals for population proportion and mean. We will use two examples to explain the idea
behind the construction of the confidence intervals.

Example 4.4.4 The figure below shows part of the data set “2020 Resale Price Data”. This data set
provides information on the resale transactions of HDB resale flats in the year 2020. There are a total
of 23334 transactions (the population) in 2020 and there are 14 variables in this data set.

1 You are encouraged to explore the R shiny app at https://david-chew.shinyapps.io/WhySRS/


132 Chapter 4. Statistical Inference

To illustrate the construction of the confidence interval for population proportion, we will consider
the variable “flat type”. This variable indicates whether the resale flat is of the type 1-room, 2-rooms,
3-rooms, 4-rooms, 5-rooms, executive or multi-generational. It is clear that “flat type” is a categorical
variable with 7 categories. Suppose we ask the following question on the population parameter:

Among the HDB resale transactions in 2020, what proportion (denoted by p) of them
is for 5-room flats?

Now, let’s say that a simple random sample of 2000 resale transactions are taken and the breakdown
of the 2000 transactions according to flat type is shown in the table below.

Flat type
1-rm 2-rm 3-rm 4-rm 5-rm Executive Multi-Gen.
Frequency 2 41 464 819 508 165 1
Proportion 0.001 0.0205 0.232 0.4095 0.254 0.0825 0.0005

508
Notice that for this sample, the proportion of resale transactions that are 5-room flats is 2000 = 0.254.
The population proportion p is unknown to us and can only be found if we take a census of all the 23334
transactions. What we are interested to know is how good an estimate is our sample proportion of 0.254.
If we assume that there is no bias in our sample, then

0.254 = population proportion + random error.

It should be noted at this point that random error can be positive or negative. If the random error is
positive, then the sample proportion of 0.254 is larger than the population proportion. On the other hand,
if the random error is negative, then the sample proportion is smaller than the population proportion.
To construct a confidence interval for the population proportion, we use the following formula
r
∗ ∗ p∗ (1 − p∗ )
p ±z × ,
n
where

p∗ = sample proportion

z = “z-value” from standard normal distribution
n = sample size
Section 4.4. Statistical Inference and Confidence Intervals 133

The exact value of z ∗ depends on the confidence level of the confidence interval we are constructing. For
a 90% confidence interval, the value of z ∗ is 1.645 while for a 95% confidence interval, the value of z ∗ is
1.96. Thus, for this example, the 95% confidence interval for the population proportion is
r
0.254(1 − 0.254)
0.254 ± 1.96 × = 0.254 ± 0.0191.
2000

So the interval is 0.254 ± 0.0191.

Remark 4.4.5

ˆ While the computation of the confidence interval is simple, for this course, we will use software to
help us perform these computations.

ˆ In particular, the value of z ∗ that is dependent on the chosen confidence level can be found from
statistical tables similar to the one shown below. However, when we use software for the compu-
tation of confidence intervals, these values will be appropriately chosen by the software when we
specify the confidence level we wish to use.

Discussion 4.4.6 Now that we have seen how the computation of a confidence interval for population
proportion is done, what is also important is the interpretation of the interval. What does it mean to
say that the 95% confidence interval for the population proportion of 5-room resale flat transactions in
2020 is 0.254 ± 0.0191?
A confidence interval is reported in 2 parts, namely:

ˆ The confidence level (for example, 95% in the example above); and

ˆ The interval (0.254 ± 0.0191 for the example above).

The value 0.0191 is known as the margin of error which directly impacts the width (how wide/narrow)
of the confidence interval.
134 Chapter 4. Statistical Inference

In reporting the confidence interval computed above, we say

We are 95% confident that the population proportion (the parameter in this case) of
resale flat transactions in 2020 that are 5-room, lies within the confidence interval.

It is natural to ponder what we mean by “95% confident”. In the context of confidence intervals, this
has a specific meaning which can be explained by repeated sampling. Recall that the sample statistic
of 0.254 was computed from a single sample (collected via Simple Random Sampling) of 2000 resale
transactions. It was from this sample statistic that the confidence interval was constructed.
The idea of repeated sampling is based on the supposition that many simple random samples of the same size
are taken and with the different sample statistics obtained from the different samples, different confidence
intervals are constructed using the same method as above.

Using the idea of repeated sampling, the interpretation of “95% confident” is that if many simple
random samples of the same size are taken, and a confidence interval is constructed for each of them,
then about 95% of the confidence intervals constructed would contain the population parameter. Thus,
if we collected 100 simple random samples and their 95% confidence intervals were computed in the same
manner, then about 95 out of the 100 confidence intervals will contain the population parameter. So
in the figure above, assuming that the purple line is the actual population proportion of 5-room resale
flats, the confidence intervals constructed for samples 1 and 2 would actually contain the population
parameter while the confidence interval derived from sample 100 would not.
It is important to remember that in actual fact, we do not know what is the exact value of the
population parameter. Confidence intervals certainly give us a better idea of where this parameter lies
but they can never tell us its exact value.
Section 4.4. Statistical Inference and Confidence Intervals 135

Remark 4.4.7 Going back to Example 4.4.4, based on the first sample of 2000 households, it is a
common mistake to say that there is a 95% chance that the population proportion of 5-room resale
flats lies between 0.235 and 0.273. It is actually incorrect to make a statement like this because

ˆ The population proportion p is “fixed”, although unknown to us. There is no probabilistic element
in what this proportion is going to be.

ˆ For any particular sample, the confidence interval constructed only depends on the sample propor-
tion and the value of z ∗ corresponding to a chosen confidence level. Thus, the confidence interval
is also “fixed” and there is also no probabilistic element in it.

Thus either the population parameter IS in the interval or it IS NOT. It is wrong to say there is a 95%
chance that it is in the interval (and 5% chance that it is not)! The element of chance (or probability)
comes from the uncertainty of sampling rather than the uncertainty in the value of the population
parameter. Therefore, we should always remember the interpretation as the percentage of samples of
the same size collected repeatedly, using the same method of simple random sampling, that give rise to
confidence intervals containing the unknown population parameter.

Remark 4.4.8 (Properties of confidence intervals.)

1. Recall that in Example 4.4.4 we computed the confidence interval using the sample estimate of
0.254, confidence level of 95% and sample size n = 2000:
r
0.254(1 − 0.254)
0.254 ± 1.96 × = 0.254 ± 0.0191.
2000
What happens when another sample is taken, using the same sampling frame, same sampling
method (simple random sampling) but a smaller sample size of 1000? If this new sample also
resulted in the sample estimate of 0.254, the 95% confidence interval would be
r
0.254(1 − 0.254)
0.254 ± 1.96 × = 0.254 ± 0.0270.
1000
This confidence interval is wider than the previous one because the sample size is smaller.
Similarly, if yet another sample is taken, under exactly the same conditions with the only difference
being that the sample size is 5000, then if the sample estimate is again 0.254, the confidence interval
will now be r
0.254(1 − 0.254)
0.254 ± 1.96 × = 0.254 ± 0.0121.
5000
What we are seeing here is that the larger the sample size, the smaller the random error, which
will then result in a narrower confidence interval. This is not surprising as we have seen in Chapter
1 that increasing sample size can result in reducing random error.

2. Other than the size of the sample, the other factor that affects the width of the confidence interval
is the confidence level. Recall that when we set the confidence level to be 95%, the confidence
interval obtained, based on n = 2000 and the sample proportion of 0.254 was 0.254 ± 0.0191. What
happens if we set the confidence level to be 90%? In this case, the value of z ∗ is 1.645 and the 90%
confidence interval for the population proportion is
r
0.254(1 − 0.254)
0.254 ± 1.645 × = 0.254 ± 0.0160.
2000
So the interval is 0.254 ± 0.0160. This interval is narrower than the interval obtained when the
confidence interval was 95%. Using the idea of repeated sampling, this makes sense since having a
narrower interval would imply that a smaller percentage (90% and not 95%) of repeated samples
would contain the population parameter. Generally speaking, the higher the confidence level at
which the confidence interval is constructed, the wider the confidence interval.
136 Chapter 4. Statistical Inference

Example 4.4.9 Let us consider another example on the construction of a confidence interval, where
we would like to estimate the population mean based on a sample mean. Using the same data set
previously, containing all the resale transactions of HDB flats in 2020, we would like to investigate the
mean resale price of all the transactions by constructing confidence intervals.

We will describe how a confidence interval for population mean resale price is constructed. The
properties and interpretations of the confidence interval for a population mean are similar to those for a
population proportion that we have discussed in Example 4.4.4. Again, we will not be computing these
confidence intervals by hand but instead use software to help us perform these computations.
Suppose we have a sample, obtained via simple random sampling, with sample size 2000. The sample
mean resale price is found to be x = $448727. Let µ be the population mean resale price, a population
parameter of interest that is unknown to us unless we take a census of the population. A 95% confidence
interval for the population mean µ is constructed using the formula
s
x ± t∗ × √ ,
n

where

x = sample mean

t = “t-value” from t-distribution
s = sample standard deviation
n = sample size

The exact value of t∗ depends on the sample size n and the confidence level of the confidence interval
we are constructing. Without going into the computation details, we will simply state that the 95%
confidence interval for the population mean is found to be 448727 ± 6706.01.

The margin of error, $6706.01 is a way of quantifying the random error and as discussed previously,
this error can be reduced by increasing the sample size n (everything else being equal). The width of
the confidence interval can also be narrowed if we reduce the confidence level to one that is lower than
95%.
To summarise this section on confidence intervals, recall the following:
Section 4.5. Hypothesis Testing 137

1. The use of confidence intervals is a way for us to quantify random error that is present in every
sample, even in those obtained via simple random sampling where the level of bias can be reduced
or assumed negligible.

2. Confidence intervals and the confidence level used to compute the intervals can be understood via
the idea of repeated sampling. We should avoid using the word “chance” or “probability” when
we discuss whether the population parameter lies inside the confidence interval constructed from
a single sample.

3. We have discussed some properties of confidence intervals, in particular how the interval varies
according to the sample size and the confidence level applied.

4. We saw how confidence intervals are constructed for two population parameters, namely the popu-
lation proportion and the population mean. It is useful to understand how the construction is done
although for the purpose of this module, we will rely on software to assist in the computation.

Section 4.5 Hypothesis Testing

Discussion 4.5.1 The second tool for statistical inference is hypothesis testing. Recall that when a
sample is taken from a population, we can try to use a sample statistic to infer on a population parameter.
If biases can be reduced to something neglibible, what separates the sample statistic from the population
parameter is the quantity of random error as seen previously

Sample statistic = population parameter + bias + random error.

In this section, we will assume that our sample is taken from the population using simple random
sampling, from a perfect sampling frame and with 100% response rate.

Definition 4.5.2 A hypothesis test is a statistical inference method used to decide if the data from a
random sample is sufficient to support a particular hypothesis about a population.

Discussion 4.5.3 A typical hypothesis about the population could be anything we want to know about
the population. For this course we will focus on two types of hypothesis about the population, in
particular, whether

(i) a population parameter is x;

(ii) in the population, 2 categorical variables A and B are associated with each other.

For example, a survey obtained from a simple random sample of Singaporeans (assuming a good
sampling frame and 100% response rate) found that 2 in 5 Singaporeans struggle with mental health
issues. A hypothesis can be made that in the population of Singaporeans, 3 in 5 (or a proportion of 0.6)
struggle with mental health issues. Using our sample where the sample proportion is 0.4, we can test
whether it is sufficient to reject the null hypothesis that the population proportion is 0.6.
Confidence intervals discussed in the previous section and hypothesis testing are two related methods.
Using the same example of population proportion above

(a) we can use our sample proportion of 0.4 to construct a confidence interval and say, with some
degree of confidence that our interval contains the hypothesised population proportion.

(b) we can also answer the question of whether it is likely to observe a sample proportion of 0.4 if the
population proportion is hypothesised to be 0.6.
138 Chapter 4. Statistical Inference

Hypothesis testing asks if our observed sample proportion’s deviation from the hypothesised popula-
tion proportion can be explained by chance variation. Of course, the bigger the difference between the
sample proportion and the hypothesised population proportion, the less likely this difference is due to
random chance.
(Five steps of hypothesis testing)
Step 1: The first step of hypothesis testing is to identify the question and state the null hypothesis and
alternative hypothesis. How these hypotheses are stated depends on the context of the question and our
aim.
Step 2: Next, we have to set the significance level of our test. The significance level of a hypothesis
test is a measurement of our threshold or tolerance for determining if the deviation of what is observed
for the sample, from what is hypothesised for the population, can be explained by chance variation. The
significance level is often set at 5%, although others like 1% or 10% are also used frequently.
Step 3: Using our sample, we find the relevant sample statistic.
Step 4: With the sample statistic and the hypothesis, we can calculate the p-value (see Definition 4.5.4).
Step 5: We then make a conclusion of the hypothesis test. What the conclusion turns out to be depends
on the p-value calculated and the significance level set for the test.

Definition 4.5.4 The p-value is the probability of obtaining a result as extreme or more extreme than
our observation in the direction of the alternative hypothesis, assuming the null hypothesis is true.
Using examples, we will now describe three types of hypothesis tests commonly used. These examples
are based on the data set StudentsPerformance.csv (the population, with 1000 observations) and a sample
SP Sample A.csv, of size 200, taken from it. You may access these csv files by referring to the Technical
videos provided for this Chapter. The background of the data set is from a high school with 1000
students. Some information of each student is provided in the form of categorical variables like gender
and ethnicity. The scores obtained by each student for three tests are also provided.

Example 4.5.5 (Hypothesis test for population proportion) The figure below shows a snapshot
of the sample. For this particular example, we are looking at the categorical variable “test prep”. The
principal of the school believes that his students are generally hardworking and half of them would have
completed the test preparation course before taking the three tests. On the other hand, the teachers of
the school, who claim they know the students better, feel that the students are lazy and less than half of
them would have completed the course before sitting for the three tests. We will conduct a hypothesis
test for population proportion at 5% significance level to check if there is sufficient evidence to reject the
principal’s hypothesis that half of the student population are hardworking.

To proceed, we need to state our null and alternative hypotheses. Recall that the null hypothesis
corresponds to the case where our observation can be explained by chance variation, that is, in this
example, that the principle’s belief is correct. On the other hand, the alternative hypothesis corresponds
to the case where our observation is not due to random chance.
Null hypothesis:

ˆ Population proportion = 0.5.

ˆ We can write this as H0 : p = 0.5.


Section 4.5. Hypothesis Testing 139

Alternative hypothesis:

ˆ Population proportion < 0.5.

ˆ We can write this as H1 : p < 0.5.

Note: For this example and Example 4.5.7, when the hypothesis test is conducted for the population
proportion or population mean, the null hypothesis typically takes the form

H0 :population parameter = null value.

while the alternative hypothesis will be either

H1 :population parameter < null value


or
H1 :population parameter > null value.

It should be noted that the null and alternative hypotheses should be mutually exclusive, meaning that
they cannot be true simultaneously.
Let us return to our example where we would like to test the principal’s hypothesis that the population
proportion p = 0.5. The next step requires us to choose the significance level. For this example, we will
select the significance level to be 5%. Step three is to determine the sample statistic which in this case is
simply the sample proportion derived from the sample SP Sample A.csv. A quick check of the .csv file
using an appropriate software will reveal that the sample proportion, denoted by p∗ equals 0.335.
There are two trains of thought right now.

(T1) The principal’s hypothesis is correct. Despite the low sample proportion p∗ = 0.335 being observed,
this is a chance variation and simply because when the sample was selected, fewer students who
completed the test preparation course were drawn into the sample.

(T2) The population proportion p is really smaller than 0.5 and thus it is natural to see the sample
proportion p∗ to be less than 0.5 (p∗ = 0.335).

Without going into the details of how the p-value is computed, we use a software to calculate the
p-value in this case which turns out to be smaller that 0.001. The interpretation of this is that the prob-
ability of obtaining a sample proportion of p∗ = 0.335 or lower, (assuming that the null hypothesis
is true, that is p = 0.5) is smaller than 0.001, which is very small indeed.
To make our conclusion, we compare the p-value (which is smaller than 0.001) with the significance
level that we have set, which is 5% (or 0.05). Since the p-value is smaller than 0.05, our conclusion is
to reject the null hypothesis in favour of the alternative hypothesis. That is, we reject the explanation
given in (T1) in favour of that explained in (T2).

Remark 4.5.6 In Example 4.5.5, since the p-value computed was smaller than the significance value,
we rejected the null hypothesis in favour of the alternative hypothesis. The table below shows the two
possibilities when the computed p-value is compared to the significance level.

p-value < significance level p-value ≥ significance level


Sufficient evidence to reject null Insufficient evidence to reject the
hypothesis in favour of the alter- null hypothesis. The hypothe-
native hypothesis sis test is inconclusive. This
does not mean that we accept the
null hypothesis.

Example 4.5.7 (Hypothesis test for population mean) We next discuss the second type of hy-
pothesis test, where we hypothesise the population mean instead of proportion. We will use the t-test
to perform the hypothesis test for population mean.
140 Chapter 4. Statistical Inference

Returning to the high school example, the principal believes that his students are poor readers and
that the average reading score of all the 1000 students in the school is 69. The teacher who is in charge
of teaching reading skills to all the 1000 students think differently and believes that the average reading
score of all the students in the school is greater than 69. Again, we will conduct the test at 5% significance
level.
The null and alternative hypotheses in this case is as follows:
Null hypothesis:

ˆ Population mean reading score = 69.

ˆ We can write this as H0 : µ = 69.

Alternative hypothesis:

ˆ Population mean reading score > 69.

ˆ We can write this as H1 : µ > 69.

The sample statistic in this case is simply the sample mean reading score derived from the sample
SP Sample A.csv. A quick check of the .csv file using an appropriate software will reveal that the sample
mean reading score x is equal to 70.345. What are the two trains of thought now?

(T1) The principal’s hypothesis is correct. Despite the high sample mean µ = 70.345 being observed,
this is a chance variation and simply because when the sample was selected, more students who
scored higher marks for reading were drawn into the sample.

(T2) The population mean µ is really larger than 69 and thus it is natural to see the sample mean
x(= 70.345) to be larger than 69.

Again without going into how the p-value is computed, we use a software to obtain the p-value in this
case, which turns out to be 0.093. Note that in this case, the p-value is greater than the significance level
of 0.05. Thus, the conclusion of our test is that there is insufficient evidence to reject the null hypothesis
that the population mean reading score is 69.

Example 4.5.8 (Hypothesis test for association) In this third example, we discuss a hypothesis
test for association between categorical variables in the population. This is done using a chi-squared
test.

The two categorical variables we are interested to investigate are gender and test preparation. In other
words, we would like to test whether gender (male/female) is associated with the test preparation course
(completed/not completed) at the population level. We will also conduct the test at 5% significance
level. The null and alternative hypotheses are stated as follows:
Section 4.5. Hypothesis Testing 141

Null hypothesis:

ˆ There is no association between gender and test preparation course at the population level.

Alternative hypothesis:

ˆ There is an association between gender and test preparation course at the population level.

A simple check on the sample of 200 students reveal the number of male/female students who com-
pleted/not completed (none) the preparation course, as shown in the table below.

test prep course


Gender Completed None Total
Female 30 66 96
Male 37 67 104
Total 67 133 200

On the other hand, if we assume that H0 is true, then we would expect the following:

(i) Based on 200 students (96 females and 104 males), where 67 completed the course and 133 did not
complete the course, the number of female students who completed the course should be
96
× 67 ≈ 32.16.
200

(ii) This implies that 67 − 32.16 = 34.84 male students completed the course.

(iii) Similarly, we have


96 − 32.16 = 63.84
female students who did not complete the course and 104 − 34.84 = 69.16 male students who did
not complete the course.

The p-value for the chi-squared test then calculates the probability of getting our observation as such,
and even more extreme (bigger difference between expected and observed figures) assuming that the null
hypothesis is true. Using a software, we obtain a p-value of 0.517. Since the p-value is not smaller than
the significance level of 5%, we conclude that there is insufficient evidence to reject the null hypothesis
and the test is inconclusive.
We have only given very brief descriptions of the three types of hypothesis tests. Further explanations
on them, as well as the technicalities involved can be found in standard books on statistics, but are beyond
the scope of this course.
142 Chapter 4. Statistical Inference

Exercise 4
1. A researcher developed a new test to detect COVID-19 in humans and the test has a specificity
of 0.90. He administers the test in a town of 100,000 people, of whom 1% have COVID-19, as
indicated in the contingency table below.

Positive Negative Row total


COVID-19 1000
No COVID-19 99000
Column total 100000

What can be said about the sensitivity of the test, assuming that the researcher obtained

1
rate(COVID-19 | Negative) =
298
for his test?

(A) The sensitivity is less than 80%.


(B) The sensitivity is more than 80%.
(C) The sensitivity is equal to 80%.

2. A player rolls a fair six-sided die twice. You can assume the rolls are independent. We define the
following events:
A: The first roll shows numbers 1 or 2.
B: The second roll shows numbers 5 or 6.
C: The sum of the two rolls is less than or equal to 7.
Consider the following statements:

(I) P (B | C) > P (B).


(II) P (A and C) = P (A) × P (C).

Which of the statements above must be true?

(A) Only (I).


(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).

3. A game is played using a fair six-sided die, a pawn and a simple board as shown below. (A pawn
is a chess piece.)

S 1 2 3 4 5 E

Initially, the pawn is placed on square S. The game is played by throwing the die and moving the
pawn back and forth in the following manner:

S 1 2 3 4 5 E 5 4 3 2 1 2 3 .....
Exercise 4 143

Thus, for example if the first and second throws of the die give a “5” and “4” respectively, the final
position of the pawn will be on square “3”, because the first throw would send the pawn to square
5, and the second throw would then send the pawn from square “5” to square “3”.
The game will stop only when the pawn stops at square “E” after a die roll, passing by “E” does
not end the game.
Let X denote the number of throws of the die required to move the pawn such that it stops at
square “E”. Which of the following statements is/are true?
5
(I) P (X = 2) = 36 .
(II) The events X = 1 and X = 2 are mutually exclusive.

(A) Only (I).


(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).

4. There are 5 identical bags, except that 2 are coloured red and 3 are coloured blue. Each of the
bags contains 4 identical balls, except that 3 are coloured yellow and 1 is coloured green. Let A be
the event that a randomly selected bag is red, and B be the event that a ball randomly selected
from the chosen bag is yellow. You are given that

P (A and B) = 0.3.

What can we say about the events A and B?

(I) The two events are mutually exclusive.


(II) The two events are independent.

(A) Only (I).


(B) Only (II).
(C) Neither (I) nor (II).
(D) Both (I) and (II).

5. Suppose A and B are events with probabilities P (A) = 0.4 and P (B) = 0.7. Which of the following
statements is/are correct?

(I) A and B can be mutually exclusive.


(II) P (A and B) = 0.4 + 0.7.

(A) Only (I).


(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).

6. I have a fair 12-sided (dodecahedron) die with sides labelled 1, 2, . . . , 12 respectively. I also have a
fair 6-sided die with sides labelled 1, 2, . . . , 6 respectively. I first roll the 12-sided die on the table,
then roll the 6-sided die. Assume that the two die rolls are independent, what is the probability
that the sum of the numbers appearing face up on the two dice is 11?
1
(A) 12 .
5
(B) 36 .
1
(C) 18 .
144 Chapter 4. Statistical Inference

1
(D) 9.

7. We wish to deploy a certain number of sensors around a particular area so as to detect intruders
moving through the area. We may assume that the sensors function independently and each has
probability 0.9 of detecting an intruder in the area. We would like to achieve at least 99.5% success
rate of detecting an intruder using the sensors. What is the minimum number of sensors we need
to deploy in order to achieve this target?

(A) Two.
(B) Three.
(C) Four.
(D) Target cannot be achieved.

8. There is a new home test kit for HIV detection. This test kit is known to be 99% accurate,
meaning that its sensitivity and specificity are both 99%. Which of the following statements must
be correct?

(I) Among those with HIV, 1% will test negative using the kit.
(II) Among those who test negative using the kit, 1% will have HIV.

(A) Only (I).


(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).

9. Mr Tan is doing a study on a fair coin. He flips it twice and records the results on a paper. If the
result is ‘Heads’, he writes down ‘H’. If the result is ‘Tails’, he writes down ‘T’. Mr Chan happened
to see part of the paper and saw a ‘H’. Mr Chan then tells Mr Tan: “I guess that both your coin
flips are Heads”. What is the probability that Mr Chan is correct?

(A) 1/1
(B) 1/2
(C) 1/3
(D) 1/4

10. John plays in the final round of a rock-paper-scissors tournament against his opponent. The final
round consists of 5 games of rock-paper-scissors. The first person to win 3 games wins the final
round. John has already won 1 game and lost 2 games in the final round.
The probability of winning each game is independent from one another. The probability of John
winning a game is 0.4, and the probability that John wins the “best player award” is 0.4.
Which of the following statements must be true? Select all that apply.

(A) The probability that John wins the next game is greater than the probability that John wins
the final round.
(B) P(John wins the next game | John wins best player award) = P(John wins best player award
| John wins the next game).
(C) John winning the next game is independent of John winning the best player award.
(D) John winning the next game and John winning the best player award are mutually exclusive.
Exercise 4 145

11. In a cohort of final year undergraduate statistics students, 70% are male and 30% are female. Every
student has the option of completing either a research project or a final year internship but not
both. 40% of the males and 70% of the females decided to take the research project option. If I
randomly pick a student from this cohort, what is the exact probability that the student picked
the option of a research project?

12. After coming across an old Channel News Asia article: ”Should women do National Service now?
Societal cost will ’far outweigh’ benefits, says Ng Eng Hen.” published on 9 May 2022, Chase
and Jenny started chatting about whether most Singaporeans will think that women should not
be enlisted into National Service (as deduced from the article). Chase said that a randomly
picked person from the Singaporean population who thinks that women should not be enlisted
into National Service, is equally likely to be male or female. Jenny tried to represent what Chase
said using a probability statement. Let A be the event that the person chosen thinks that women
should not be enlisted into National Service, and B be the event that the person chosen is male.
Which of the following statements correctly represents what Chase said?

(A) P(B | A) = P(not B | A).


(B) P(B | not A) = P( not B | not A).
(C) P(A | B) = P( A | not B).
(D) None of the other statements.

13. A poll conducted before a local election gives a 95% confidence interval for the percentage of voters
who support candidate X as (54%, 60%). Based on the same poll result, which of the following can
potentially be a 99% confidence interval for the percentage of voters who support candidate X?

(A) (56%, 58%).


(B) (52%, 62%).
(C) (54%, 60%).

14. A researcher takes a random sample from Country X’s population to estimate its unemployment
rate. From the sample, the researcher obtains the 95% confidence interval for the population
unemployment rate to be between 0.18 and 0.22.
Which of the following statements correctly interprets the results?

(A) If many samples of the same size were collected using the same procedure, and their respective
confidence intervals calculated in the same way, about 95% of these samples will have the
sample unemployment rate lie between 0.18 and 0.22.
(B) If many samples of the same size were collected using the same procedure, and their respective
confidence intervals calculated in the same way, about 95% of these samples will have the
population unemployment rate lie between 0.18 and 0.22.
(C) If many samples of the same size were collected using the same procedure, and their respective
confidence intervals calculated in the same way, about 95% of these samples will have the
sample unemployment rate lie within the samples’ respective confidence intervals.
(D) If many samples of the same size were collected using the same procedure, and their respective
confidence intervals calculated in the same way, about 95% of these samples will have the
population unemployment rate lie within the samples’ respective confidence intervals.

15. A 95% confidence interval, constructed from a random sample, for the population mean number of
children per household in Country Z is (1.21, 4.67). Which of the following statements is/are true?
Select all that apply.

(A) We are 95% confident that the population mean number of children per household in Country
Z lies between 1.21 and 4.67.
146 Chapter 4. Statistical Inference

(B) 95% of all samples of the same size and sampling procedure should have sample mean number
of children per household between 1.21 and 4.67.
(C) 95% of all households in Country Z have between 1.21 and 4.67 children.
(D) If we take 100 different samples of the same size using the same sampling procedure and
compute the confidence interval for each sample in the same way, approximately 95 of the
intervals will contain the true population mean.

16. Two different random samples (call them Sample 1 and Sample 2) of size 100 each were chosen
from a population of 10000 people. For Sample 1, the 95% confidence interval (call this Interval
1) for the population mean height was calculated. For Sample 2, the 99% confidence interval (call
this Interval 2) for the population mean height was calculated. Which of the following statements
is/are correct? Select all that apply.

(A) If the population mean height lies in Interval 2, then it must lie in Interval 1.
(B) The population mean height must lie in at least one of the two confidence intervals.
(C) It is possible that the population mean height does not lie in both Intervals 1 and 2.
(D) It is possible that the population mean height lies in both Intervals 1 and 2.

17. A random sample of size 500 is taken 100 times from the same population. Which of the following
statements is/are correct?

(I) If a 99% confidence interval of the population parameter is created for each of the 100 samples,
more than 90% of the intervals should contain the population parameter.
(II) A smaller sample is likely to give a narrower 99% confidence interval.

(A) Only (I).


(B) Only (II).
(C) Both (I) and (II).
(D) Neither (I) nor (II).

18. Edd wants to find the proportion of the recently graduated cohort of NUS students who have
obtained first-class honors. He first took a random sample X of 300 students from the graduating
cohort of 3000 students and calculated the proportion of sample X who have obtained first-class
honors. He then constructed a 95% confidence interval using sample X. He repeated this process
four more times, using the same sampling process, but with various sample sizes and confidence
levels. His results are shown in the table below.

Sample Sample Size Confidence Level Confidence Interval


X 300 95% [0.025, 0.075]
A 300 95% [0.023, 0.071]
B 300 95% [0.018, 0.062]
C 200 95% [0.020, 0.080]
D 100 99% [0.041, 0.059]

Assume that the calculation for sample X’s confidence interval is correct. Edd’s friend claims
that Edd made errors in his calculations of the other confidence intervals. Which of the following
statements can be deduced from the information above?

(A) Edd made a miscalculation for sample A, since it has a similar confidence interval as that of
sample X.
(B) Edd made a miscalculation for sample B, since it has a very different confidence interval from
that of sample X.
Exercise 4 147

(C) Edd made a miscalculation for sample C, since it has a wider confidence interval than that
from sample X.
(D) Edd made a miscalculation for sample D, since it has a narrower confidence interval from
sample X.

19. Suppose that we have a population of 10000 ten-year-old school children. Two random samples
are taken, using the same sampling method, from the population and the sex (boy or girl) of the
children in the samples are noted. Some information about the two samples is given below.

Sample size Sample proportion of boys


Sample 1 400 p1
Sample 2 100 p2

You are given that no child was selected in both samples and also that p1 > 0, p2 > 0.
Various confidence intervals for p, the population proportion of boys, are constructed. For any
confidence interval [x, y], we define the length of the confidence interval to be y–x. Details of the
three confidence intervals are given below:

ˆ Confidence interval 1 (CI1): 90% confidence interval for p constructed using Sample 1.
ˆ Confidence interval 2 (CI2): 95% confidence interval for p constructed using Sample 1.
ˆ Confidence interval 3 (CI3): 95% confidence interval for p constructed using Sample 2.

Let the length of CI1 (respectively, CI2, CI3) be L1 (respectively, L2 , L3 ). Which of the following
statements must be true? Select all that apply.

(A) L1 is always smaller than L2 .


(B) If p1 = p2 , then L3 = 2L2 .
(C) If p is contained inside CI2, then it must be contained inside CI1.
(D) If Sample 3 is obtained by combining Samples 1 and 2, then the proportion of boys in Sample
3 is 15 (4p1 + p2 ).

20. Candy took a random sample from all the students in Toontown secondary school and admin-
istered an IQ test to all the 100 students in the sample. Using all the 100 IQ scores and the
confidence interval formula for population mean that she learnt from GEA1000, she constructed a
95% confidence interval for the population average IQ to be [99.34, 102.42]. Which of the following
statements is/are incorrect? Select all that apply.

(A) If many samples of 100 students were collected using the same procedure, and their respective
confidence intervals calculated in the same way, about 5% of these samples will not have the
average IQ of all Toontown secondary school students lie between 99.34 and 102.42.
(B) A 90% confidence interval constructed using Candy’s sample and the same confidence interval
formula that she used will be contained inside [99.34, 102.42].
(C) We are 95% confident that the IQ scores of all Toontown secondary school students lie below
102.42.
(D) If many samples of 100 students were collected using the same procedure, and their respective
confidence intervals calculated in the same way, about 5% of these samples will not have their
sample average IQs lie in their respective confidence intervals.

21. Using a random sample of 1000 Taylor Swift concert goers, the 95% confidence interval for the mean
number of friendship bracelets is found to be [2.46, 2.94]. The same sample is used to construct
the 99% confidence interval for the mean number of friendship bracelets, and the upper bound is
found to be 3.01. What is the lower bound of this 99% confidence interval?
148 Chapter 4. Statistical Inference

22. A coin manufacturer claims that he has produced a biased coin with P (H) = 0.4 and P (T ) = 0.6,
where P (H) denotes the probability of the coin landing on heads and P (T ) denotes the probability
of the coin landing on tails. Out of 10 independent tosses, Brad observes 8 heads and 2 tails. Based
on these data, he decides to do a hypothesis test to see if there is enough evidence to reject the
manufacturer’s claim. Which of the following statements should he adopt as his null hypothesis?

(A) P (H) = 0.4.


(B) P (H) = 0.5.
(C) P (H) = 0.8.
(D) P (H) = 0.6.

23. A researcher is interested to know if smoking and heart disease are associated with each other in
the population of Singapore. The researcher carries out a census of the population with a 100%
response rate. The researcher conducts a chi-squared test on the census data at 5% significance
level and obtains a p-value of 0.001. Which of the following is a valid conclusion?

(A) Since p-value is less than 0.05, the null hypothesis is rejected at the 5% significance level,
and the researcher concludes that smoking and heart disease are associated with each other
in the population.
(B) Since p-value is less than 0.05, the null hypothesis is not rejected at the 5% significance
level, and the researcher concludes that smoking and heart disease are associated with each
other in the population.
(C) Since p-value is less than 0.05, the null hypothesis is not rejected at the 5% significance
level, and the researcher concludes that smoking and heart disease are not associated with
each other in the population.
(D) None of the other options is a valid conclusion.

24. Suppose we want to test if a coin is biased towards tails. We decide to toss the coin 10 times and
record the number of heads. We shall assume the independence of coin tosses, so that the 10 tosses
constitute a probability experiment.
Let X denote the number of heads occurring in 10 tosses of the coin. We will carry out a hypothesis
test with X as the test statistic. Let H be the event that the coin lands on heads, in a single toss.
We set our hypotheses to be

ˆ H0 : P (H) = 0.5,
ˆ H1 : P (H) < 0.5.

Suppose in our execution of the 10 tosses, we observe 7 heads. This means X = 7 is the test result
we observe.
Recall the definition of p-value to be the probability of obtaining a test result at least as extreme
as the one observed, assuming the null hypothesis is true. Which of the following is/are possible
test result(s) “at least as extreme as the one observed”, in this scenario? Select all that apply.

(A) X = 4.
(B) X = 6.
(C) X = 8.
(D) X = 10.

25. A group of students wants to find out if there is any association between staying in a hall and
being late for class in NUS in a particular month. If students are late for at least 5 classes, they
are considered “late for class” in that month. After collecting a random sample of 1000 students,
Exercise 4 149

they found that 200 out of 350 students who stay in a hall are late for class, while 390 out of 650
students who do not stay in a hall are late for class.
A chi-squared test was done to test for association between staying in a hall and being late for
class at 5% level of significance. The p-value derived from the chi-squared test is 0.3809.
Which of the following statements is/are true? Select all that apply.

(A) There is a positive association between staying in a hall and being late at the sample level.
(B) There is a negative association between staying in a hall and being late at the sample level.
(C) Since the p-value is more than 0.05, we can conclude that there is an association between
staying in a hall and being late at the population level.
(D) Since the p-value is more than 0.05, we cannot conclude that there is an association between
staying in a hall and being late at the population level.

26. 25 mothers were each allowed to smell two articles of infant’s clothing. Each of them was then
asked to pick the one which belongs to her infant. They were successful in doing so 72% of the time.
You want to show that this has not happened by chance and mothers can indeed recognise the
smell of their children. To test such a hypothesis, what should the null and alternative hypotheses
be?

(A) H0 : P(Success)= 0.5; H1 : P(Success)> 0.5.


(B) H0 : P(Success)= 0.72; H1 : P(Success)> 0.72.
(C) H0 : P(Success)= 0.5; H1 : P(Success)< 0.5.
(D) H0 : P(Success)= 0.72; H1 : P(Success)< 0.72.

27. A hypothesis test is done to find out whether vaccine X prevents cancer in a population of dogs,
where cancer affects 10% of dogs. A random sample of 100 puppies was selected for the study. All
100 puppies received vaccine X and we observed them for their entire lifetimes. 5 of these puppies
eventually had cancer. The null hypothesis is

H0 : Vaccine X has no effect on cancer in the population.

Then the p-value is

(A) the probability that vaccine X is effective.


(B) the probability that the hypothesis will be rejected.
(C) the probability that 5 puppies out of 100 have cancer, given that the probability of cancer is
0.1.
(D) the probability that 5 or less puppies out of 100 have cancer, given that the probability of
cancer is 0.1.

28. A fortune teller specialising in predicting the outcome of coin flips claims that he can successfully
predict the outcome of a coin flip 60% of the time. You suspect that he is actually more accurate
than what he claims to be and decide to do a hypothesis test. What should the null and alternative
hypotheses be?

(A) H0 : P (Success)= 0.5; H1 : P (Success)> 0.5.


(B) H0 : P (Success)= 0.5; H1 : P (Success)> 0.6.
(C) H0 : P (Success)= 0.6; H1 : P (Success)> 0.6.
(D) H0 : P (Success)= 0.6; H1 : P (Success)< 0.6.
150 Chapter 4. Statistical Inference

29. A student conducts a one-sample t-test, at a 5% level of significance, in the following way:
Null hypothesis H0 : µ = 20.
Alternative hypothesis H1 : µ > 20.
Which of the following statements is true?

(A) If the level of significance was increased to 10%, the p-value will remain the same.
(B) If the p-value > 0.05, H0 should be rejected.
(C) If H1 was changed to µ < 20, the level of significance will decrease.
(D) If H0 was rejected at a 5% level of significance, H0 will also be rejected at a 1% level of
significance for the same data set.

30. Suppose we wish to understand if there is any significant association between sex and the type of
cuisine that individual students in a Humanities College choose whenever they visit their school
canteen. There are two types of cuisines – Western and Oriental. A random sample of 100 students
was taken from the Humanities College students, with the responses of their food choices recorded,
and a chi-square test subsequently done. The p-value obtained from that test is 0.07. Based on
the test findings alone, what can we deduce at the 5% level of significance?

(A) There is sufficient evidence at the 5% level of significance to show that sex is associated with
cuisine choice among students in the college.
(B) There is insufficient evidence at the 5% level of significance to show that sex is associated with
cuisine choice among students in the college.
(C) There is sufficient evidence at the 5% level of significance to show that sex is not associated
with cuisine choice among students in the college.
(D) There is insufficient evidence at the 5% level of significance to show that sex is not associated
with cuisine choice among students in the college.
Index

Y -intercept, 164 Multimodal, 143


Normal, 144
Association, 70, 71 Peaks, 142
Moderate, 157 Skewed, 143
Negative, 70, 155 Skewness, 142
Positive, 70, 155 Symmetrical, 143
Strong, 157 Unimodal, 143
Weak, 157
Associations, 151 Ecological correlation, 161
Atomistic fallacy, 163 Ecological fallacy, 163
estimate, 3
Bar plot Event, 216
Dodged, 65 Experimental study, 19
Stacked, 65 Exploratory Data Analysis, 3, 15
Base Rate Fallacy, 226 Univariate, 139
Bias, 4 Exposure group, 24
Non-response, 4
Selection, 4 Generalisability, 4, 24
Blinding, 22 Generalisability Criteria, 8
Double, 22 Gradient, 164
Single, 22
Boxplot, 148 Histogram, 141
Whiskers, 148 Hypothesis
Alternate, 238
Causation, 71 Null, 238
Cause-and-effect relationship, 19 Test, 237
Census, 4 Hypothesis Test, 216
Cluster Sampling, 6 Hypothesis testing, 231
Coefficient of variation, 15
Conditional independence, 223 Independent, 223
Conditional Probability, 220 Interquartile range, 10, 17
Confidence interval, 231
Law of Total Probability, 225
Confidence Intervals, 216
Linear Regression, 164
Confidence level, 231
Confounder, 83, 84 Margin of error, 233
Conjunction Fallacy, 225 Mean, 10, 144
Contingency table, 67 Median, 10, 16, 144
Control group, 19 Method of least squares, 165
Controlled experiment, 19 Mode, 10, 18, 144
Convenience Sampling, 7
Correlation coefficient, 152, 155 Normalisation, 69

Density Curve, 229 Observational study, 23, 71


Deterministic, 150 Ordered pair, 152
Distribution, 140 Outcomes, 216
Bimodal, 143 Outlier, 145, 148
152 Quantitative Reasoning with Data

percentile, 17 Slope, 164


Placebo, 21 Specificity, 227
Placebo effect, 21 Spread, 145
Population, 1 Standard Deviation, 10, 145
Population parameter, 3 Standard Unit, 157
Probability, 216 Statistical Inference, 215
Experiment, 216 Statistical inference, 230
Proportion, 67 Strata, 6
proportions, 13 Stratified Sampling, 6
Prosecutor’s Fallacy, 222 Stratum, 6
Summary Statistics, 10
Quartile Systematic Sampling, 5
First, 17
Third, 17 Treatment group, 19
True Negative Rate, 227
Random assignment, 20 True Positive Rate, 227
Random Variable, 228
Continuous, 228 Uniform probability, 219
Discrete, 228
Range, 145 Variable, 8
Rate, 64 Categorial, 9
Rates Confounding, 84
Basic rule, 75 Continuous, 9
Conditional, 68 Dependent, 8
Joint, 68 Discrete, 9
Marginal, 68 Independent, 8
Symmetry Rule, 72 Nominal, 9
Regression analysis, 152 Numerical, 9
Relationship Ordinal, 9
Direction, 153 Volunteer Sampling, 7
Form, 153
Weighted average, 12
Strength, 154
weights, 12
Repeated sampling, 234
Research question, 1
Residual, 164
Robust statistics, 146
Rules of Probability, 218

Sample, 3
Self-select, 7
Sample Space, 216
Sample variance, 13
Sampling
frame, 3
Non-probability, 4, 7
Probability, 5
Sampling Without Replacement, 5
Scatter plot, 152
Sensitivity, 227
Significance Level, 238
Simple Random Sampling, 5
Simpson’s Paradox, 81
Sliced stacked bar plot, 81
Slicing, 82

You might also like