Chapter 1 Part 1
Chapter 1 Part 1
2
3
Welcome to Quantitative Reasoning with Data. My name is Samuel Yeun, and in this first chapter,
my colleagues and I hope to start you off with a better appreciation of what this “Quantitative
Reasoning” is all about, and how understanding “data” can be meaningful to the world around you.
Increasingly, you may notice that many things are becoming data driven – data to predict market
direction, data from customer feedback to improve service quality, data to determine a vaccine’s
effectiveness, etc.
4
Let’s look at one such example form the news. A 2021 article states that there was a fall in Singapore
marriages amid the Covid-19 pandemic restrictions.
5
Along with the report, some data was provided as well. We can see that the number of marriages in
2020 is 22,651 – a 10.9% fall from 2019’s data. From this information, would you say the pandemic is
to blame for the drop in marriages? Maybe we cannot jump to conclusions yet! What about
Singapore’s population size? Has it also shrunk over the year? If so, wouldn’t that also play a part in
the fall in marriages? How did the study even obtain the number 22,651 in the first place? How do
we make sense of all these numbers and information together? That is where reasoning with data –
or “Quantitative Reasoning” comes into play.
6
Let’s start with the term “Population”. In the previous marriage example, we loosely used the phrase
“Singapore’s population” to describe the number of people living in Singapore. However, in the
world of statistics and data, the word “Population” has a slightly stricter meaning. We say that the
population is the entire group (of individuals or objects) that we want to know something about.
7
This ties in nicely with the idea of a “research question”. The research question is usually one that
seeks to investigate some characteristic of a population.
For example, we might want to know the answer to questions such as:
What percentage of Singapore adults owns a car? (Population: Singapore adults)
Does Brand X pesticide work against mosquitoes? (Population: Mosquitoes)
Do NUS students that take notes using pen and paper score better than those using laptops?
(Population: NUS students)
8
In general, there are 3 types of research questions.
The first kind are questions that make an estimate about the population. The estimate is often about
an average value or a proportion with a given characteristic. For example: What is
the average number of hours that students study each week?
9
The second kind are questions that test a claim about the population. The claim is also often about
an average value or a proportion with a given characteristic. For example: Does the majority of
students qualify for student loans?
10
The last kind are questions that compare two sub-populations, or investigate a relationship between
two variables in the population. We usually compare population average values or proportion with a
given characteristic. For example: Among the population of students, are student
athletes more likely than non-athletes to do final year projects?
11
To summarize, we discussed how in a data oriented world, it is important to first know the
population of interest. We also covered how a research question is linked to the population of
interest, and how there are 3 different types of research questions.
12
Hi everyone, my name is Timothy. In this unit, we will be covering an important aspect of
quantitative reasoning, and that is Sampling.
13
14
15
In our complicated world that we live in, data of individuals often comprises of long lists of
information. For example, suppose if we wish to know how a pandemic, say COVID-19, is related to
other variables, we have many possible independent variables that may have considerable influence
on the rise in COVID-19 cases in a country.
In that light, in order to make sense of that long list of information, we adopt a systematic process
called exploratory data analysis (EDA). EDA is a very useful iterative process for us where we explore
the raw data and summarize it using graphs and different numerical measures, such as percentages
or averages, and with that, evaluate the usefulness of the existing data, and to then tweak our data
analysis processes, until we have good enough information for us to answer very good questions
with it.
What are some steps that we need to take note of? Firstly, we generate questions that we could
think of answering with the existing data set. However, prior to this step, it is possible that we might
already have a question we wish to answer using the given data.
We then understand that data by using data visualization (using histograms, scatter plots, line
graphs, for example) to observe key trends. In the process of data exploration, we could also do
some data modelling by using linear regression (to be covered later).
Finally, we ask ourselves the question – “To what extent does the data help to answer our existing
questions? Does it lead us to think about other questions that are related to the data set given?”
With that, we either refine our existing questions or, we generate new ones.”
As the exploration process continues, very often some ideas will take off, while others will end with
no conclusion. However, the eventual outcome is to have a few particularly useful questions
answered.
16
The EDA process does not stop in this unit – in the rest of the module, we will be covering in greater
detail different aspects of EDA.
In this chapter, we will look at the types of variables, be it categorical or numerical data.
Further on, we will also look specifically at categorical variables and how to identify association
between the variables.
Also, we will deal with correlation between numerical variables. We will understand how to visualize
the given data by using scatter plots, histograms and box plots.
17
A Population of Interest refers to a group in which the researcher has interest in drawing conclusions
of the study. It could be “the population of Asia”, “the population of Singapore”, “the population of
Ang Mo Kio”, or the population of “residents living in Block 123, Happiness Avenue.”.
A population parameter is a numerical fact about a population. Note that population parameters are
constants.
Very often, this group could have many individuals, and it may not be ideal for us to gather
information from every individual of this population. That is where we gather a sample, which is a
proportion of the population selected in the study.
In such cases, even though we do not know the parameter, the sample hopefully provides us with a
reasonable estimate about the parameter. The estimate is an inference about the population’s
parameter, based on information obtained from the sample.
18
Before we take a sample, we need to understand the source from where we gather our sample
from. We call this the Sampling Frame.
For example, if I wish to find out the average number of cups of coffee individual Singaporean
residents drink in a day, the population of interest would be “Singapore Residents”. Suppose if I use
“handphone numbers” as a sampling frame, the problem is two-fold: 1. Not all Singapore residents
own a local number (some foreigners own foreign phone numbers, while some locals do not own a
local phone number); 2. Some residents own more than 1 line. This presents to us the fact that
sampling frame may be insufficient to cover the population of interest OR that it contains redundant
data.
A natural question to ask about the sampling process is whether the conclusions/observations
derived from the sample can be said of the target population as well. For example, are we able to
use the average number of cups of coffee individuals in a sample of say, 500 individuals, to describe
the average coffee drinking behaviour of all Singaporean residents? In other words, is the sample
data generalisable to a larger group of individuals, preferably the target population?
In the case of the sampling frame, part of the Generalisability criteria is that – sampling frame
ℎ , ℎ ℎ target population. If the sampling frame fails to cover any
member of the target population, any sample that is taken from the sampling frame cannot be used
to generalise fully to the target population, as there exists members in the population that have
been left out.
It is important to note that the sampling frame is an important, but not the only criteria for
generalisability. We will discuss the other determinants in the later parts of this unit.
19
When we can obtain data of all members of the population of interest, within the confines of the
sampling frame, we call that a census. For example, suppose you wish to find out what is starting
salary of NUS graduates. A census attempt will mean that one will have to gather information from
all graduates – past and present. Do note that in a census attempt, one may not achieve 100%
response rate.
On the other hand, a sample is a selection of a proportion of the population of interest. It is usually
done when census data is not readily available.
Why is a sample preferred over the collection of the data of the entire population? There are a few
good reasons. Firstly, a sample is less costly administratively. Gathering and making sense of huge
amount of information may be more costly administratively.
Also, collection and processing of data is significantly faster for a sample as opposed to a census.
Agencies keen in making sense and releasing findings of the data may prefer to collect a sample, so
that they may fulfil the objectives of doing so at the fastest possible time.
20
Before we proceed on to discuss the types of sampling processes, it is important to pause for a
moment to identify the types of biases and errors associated with it. Here we wish to highlight two
important types of bias: selection bias and non-response bias.
Selection bias exists because of an imperfect sampling frame. Individuals could be unintentionally or
deliberately left out because the sampling frame of the study has excluded them, and that means
that the results of the study will be biased towards those who are in the sampling frame. For
example, suppose if a study wishes to understand the satisfaction of road users in Singapore, be it
motorists, car drivers, etc. A sampling frame that could be used is to gather all car (and motorcycle)
plate numbers, and to gather data from the owners of these vehicles. What could be a possible
group of individuals that may be left out? We could foresee that road-users who do not own a car,
but may use another owner’s car or motorcycle, will be left out of the study.
Another source of selection bias is due to the usage of non-probability sampling. In essence, non-
probability sampling method does not involve the use of chance in the selection of individuals, and
this process contributes significantly to selection bias. We will delve into it in the later parts of this
unit.
In addition, we need to be watchful over non-response bias, which is bias associated with selected
individuals not responding to the study. Generally, individuals who choose not to respond, may do so
because they may not be interested, or that responding to the survey or study may be inconvenient,
or it could well be that they are unwilling to disclose information that is important to the study due
to the sensitive nature of the study. For example, an individual who chooses not to respond to a
survey on household gambling, may do so for fear of disclosing embarrassing information about his
or her family members’ gambling habits. This non-disclosure may distort our understanding of the
true population parameter in any study. It is noteworthy that non-response bias may occur
regardless of whether the sampling method is probability, or non-probability.
21
22
Probability sampling refers to a sampling process whereby one uses a known randomised
mechanism. Every unit in the population has a non-zero and known probability of being selected.
Note: the probability of an individual in the population being selected may not necessarily be the
same. The general idea is to eliminate biases associated with human selection by using a
randomisation technique. The four kinds of probability sampling methods that we will be covering in
this course are: Simple Random Sampling, Systematic Random Sampling, Stratified Sampling and
Cluster Sampling.
23
Simple random selection with replacement guarantees that the sample results do not change
haphazardly from sample to sample. The variability in the sample results is due to chance.
Some mechanisms to conduct simple random sampling include using simple number generator, for
example, to generate random numbers assigned to subjects.
One example of simple random sampling is in the field of national polls - human research surveys of
public opinion. Like in a country, certain mask wearing policies may be a topic of interest, and we
may want to find out the public's opinion on it. We may not have time to survey everyone in the
country, but one way we could get a sample, is via the process of "random-digit dialing".
The chances of any particular group of units getting selected to be in the random sample from a
large population is quite small, but whatever group is selected is likely to be representative. For
example, in a simple random sample of 5.6 million people in Singapore, it is quite unlikely that you
and your classmate in NUS will be selected. It is extremely likely that someone in the sample will be
representative of another individual with similar opinions, for a population with diverse opinions of
the same issue.
24
Samples derived through Simple Random Sampling tends to reflect very well of the population, as
the sample variations are primarily due to chance. However, generating samples using a number
generator may be easy with the use of technology, but to be able to reach these individuals assigned
could be a problem. Accessibility could be problematic, as the selected individuals may be located in
different geographical locations, and even if these individuals could be contacted, they may choose
to opt out of the study, posing a problem of non-response.
25
To ensure that the probability of every individual selected remains the same, we do a random
selection of individuals with replacement. That means that there is a non-zero chance that a unit, say
number 5 is selected more than once. However, with larger population size and smaller samples in
proportion of the population size, we can bring that possibility of repeated selection to a minimal.
26
This is a method of selecting units from a list through the application of a selection interval, K, so
that every Kth unit on the list, following a random start, is included in the sample
27
Example: Suppose there are 110 sampling units in the population of say, dormitory residents. Each of
the 110 residents is given one unique number from 1 to 110. A study requires us to select a sample
of 10 units. Then a random number is selected from 1 to 110/10=11. If the selected number is 6,
then units 6, 17, 28, ……105 are selected to form the sample.
28
The advantage of this process is that the selection process is much more straightforward than a
simple random sample – in the mentioned example, one can comb through dormitory number by
dormitory number to pick a sample.
Another advantage is that for systematic sampling, we may not need to know the exact population
size at the planning stage. If we have a rough estimate of the number of dormitory residents in the
population, our systematic sample can still produce the same results as a simple random sample.
This means systematic sample can be taken as a simple random sample if the numbering of the
sampling units is done randomly, because it is unlikely that the order of the sampling units
is associated with characteristic of interest. Back to our dormitory example, if we did not know how
many dormitory residents there were due to the fluidity of the job's nature, we may not be able to
run a simple random sample. However, in such cases, a systematic sample is perfect!
However, we need to be mindful of the possibility that a systematic sample may not be truly
representative of the population, if the sampling list is non-random. Suppose the composition of
dormitory residents are not random, in a sense that say, we have a group of residents from a
particular nationality that prefer to stay at a particular location, or prefer a particular number over
other numbers, we may end up choosing a group of individuals with specific characteristics, more
than others. This results in the sample under-representing the population. To illustrate this problem
further, we suppose that we have 11 different nationalities and we have 10 people from each
nationality in this dormitory, and we label the numbers 1 to 110 in sequential order from Nationality
1 to 11. With the above selection, we may end up having a sample of residents comprising of all
members from Nationality 6, as numbers 6, 17, 28, and so on are individuals from Nationality 6.
Therefore, the sample is not representative as other nationalities are not included other than that of
nationality 6.
29
Next, in the case of a Stratified sampling process, the population is divided into groups called strata.
The strata are chosen so that similar cases are grouped together. Note that the size of each stratum
may not necessarily be the same. Once stratification is done, simple random sampling is employed
within each stratum.
An example of a stratified sample will be in the case of a sample count during the Singapore general
elections. Typically in an election, the time taken to tabulate the total votes cast within a
constituency or a county could be in hours or even days. Therefore, voters and party candidates
would like a preliminary estimate. That is when a sample count helps to do that estimate. An
example of how a sample count can work will be as follows: For every polling station within a
constituency, polling officials could do a simple random sample of the 100 ballot papers, and take a
weighted average based on the proportion of voters within the constituency who would cast their
votes in that specific polling station.
30
With simple random sampling done within every stratum, and proper weights assigned for every
stratum, Stratified sampling is very useful when the cases in each stratum are very similar with
respect to the outcome of interest. However, the process of doing so may be complicated and time-
consuming. One needs the information about the sampling frame and stratum, which in some cases
may be difficult to define.
31
On the other hand, In a cluster sample, we break up the population into many groups,
called clusters. We will treat each school as a 'cluster'. And do a random sample of a fixed number of
clusters, and include all observations from each of those clusters in the sample.
Suppose if you wish to conduct a study on mental wellness among students in a country, one way in
which this can be done is to adopt a cluster sampling approach. This can be done by breaking the
population of interest into the respective schools, and doing a random sample of a fixed number of
schools and include all observations from the selected clusters, by extending the study to all
students within the school itself.
32
An advantage to cluster sampling would be that it is less costly and time-consuming, as opposed to
other sampling methods, like the stratified sampling approach. However, cluster sampling may not
work very well when the number of clusters are very small, or that the clusters are not similar to
each other. In the said situation, we might encounter the possibility of high variability.
33
34
As opposed to probability sampling, non-probability sampling does not involve the use of chance in
the selection of individuals.
There are variants of non-probability sampling. Some examples include: Quota sampling,
Convenience Sampling, Judgment Sampling and Volunteer Sampling. We are going to look at two
examples: one relating to convenience sampling, and the other relates to volunteer sampling. Do
bear in mind that these sampling methods are not mutually exclusive, in the sense that a sampling
method could well comprise of various elements of say, volunteer sampling and convenience
sampling.
35
A convenience sample is a non-probability sample in which the researcher uses the subjects that are
nearest and available to participate in the research study.
A good example of convenience sampling will be surveys at shopping malls. In one recent study on
the relationship between mask mandates and compliance to COVID-19 safe management practices
in the United States, the study was done by picking 36 shopping malls in the state of Wisconsin. One
of the problems associated with such surveys, is selection bias – Individuals who frequent the malls
have a higher chance of being selected as opposed to those who do not visit them. We can think of
some examples of mall goers, such as teenagers, retired people, individuals who have higher income
and wealth propensity. On the other hand, individuals who may not be so affluent, or working
professionals, or individuals who prefer to do shopping online will have a higher chance of being left
out. So there is a bias in selection.
A second problem one will face is that individuals asked to participate in the study may opt out of
the study – due to the inconvenience of the study, or simply because they do not wish to express
their views on the subject matter. This results in non-response bias.
36
The American Family Association (AFA) is a conservative group that favours heterosexual marriage.
In 2004, the AFA began a campaign advocating the constitutional amendment to define marriage
strictly as heterosexual marriage.
The group posted a poll asking for AFA members to voice their opinion, hoping to gain traction in
their movement. The result? Almost 850,000 people responded to the poll, and a whopping 60%
favoured same-sex marriage legalisation!
Why? What happened was that the link to the poll appeared in blogs, social media websites and
email listings that are connected to other communities with differing viewpoints. And some of these
groups with stronger opinions “volunteered” to participate – against the wishes of AFA!
We will leave you to think about the following in the context of this example:
Where is the selection bias is this case? And what is the extent of non-response bias in influencing
the accuracy of the study?
37
To conclude, in general, when approaching sampling we should adopt the following process:
38
In conclusion, we wish to use a sample to say something about the population, but how can sample
statistics be used to estimate the population parameter? A few key conditions must be met:
As mentioned previously, the sampling frame must be larger than or equal to the target population.
The key principle is that members of the target population must not be left out.
Also, we learnt how a probability-based sample needs to be used to minimise selection bias.
A large sample size tends to reduce the variability of data. This helps to reduce the amount of ‘error’
in our sample’s estimate. More about this concept of ‘error’ will be covered later on.
Finally, in order for us to generalise sample findings to the population of interest, non-response must
be kept to lowest possible proportion. Suppose non-response rate is high, generalisability to the
population of interest is difficult. At best, we can only generalise the sample findings to the
proportion of the population of interest who would have responded to the study.
39
A quick summary time. In this unit, we have covered what exactly is the Exploratory Data Analysis
framework. We have understood how the whole process of data analysis is not static, but is a
dynamic one, where we keep moving on to find out what exactly the dataset could actually help us
in answering certain key questions. Then we talked about what exactly is a population, a sample, and
a sampling frame. We understood also, that there is a possibility of non-response bias, and selection
bias that can occur when it comes to the sampling process. Thereafter, we have talked about the
various kinds of probability sampling. Namely, simple random sampling, systematic random
sampling, stratified sampling, and cluster sampling. Also, we have discussed the different possible
scenarios when it comes to non-probability sampling, and we have understood how non-response
bias and selection bias could be very much active in these two kinds of examples that we have
mentioned in the context of non-probability sampling.
We ended of by talking about what are the main criterion when it comes to generalisability. We
hope that this video gives you certain information, and shed some light when it comes to the
importance of sampling, and having the right sampling frame. So that we can make sense of the
world that we live in by using the sampling process.
40