0% found this document useful (0 votes)
166 views

BSDDDM Study Guide v2.0-2

Uploaded by

Jerlin Preethi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
166 views

BSDDDM Study Guide v2.0-2

Uploaded by

Jerlin Preethi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 215

Business Statistics and Data-

Driven Decision Making


STUDY GUIDE
v2.0
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Copyright © 2019 Kaplan Singapore. All rights reserved.


KHE-LCD-SGD-00039 i
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Kaplan Desired Graduate Attributes

Through the reading of this module, Kaplan


Singapore intends to: Ability to communicate well: Graduates will
recognise the importance and value of communication
• Instill in students the value of lifelong and self- in the learning and professional environment. This
directed learning by stimulating intellectual attributed is demonstrated when students:
curiosity, creative and critical thinking and an
awareness of cultural diversity; • Create and present knowledge, arguments and
• Assist students in developing professional ideas confidently and effectively using a variety of
attributes, ethical values, social skills and methods and technologies;
strategies that will nurture success in both their • Recognise the wide range of possible audiences
professional and personal lives; for information and respond with communication
• Foster integrity, commitment, responsibility and a strategies appropriate to those audiences; and
sense of service to the community; • Work collaboratively with people from diverse
• Prepare students to meet the ever-changing backgrounds and be aware of the different roles
needs of their communities both now and in the of team members and to function within that team.
future; and
• Promote innovative and effective teaching. Independent and reflective practitioner
• Graduates will be able to work independently and
Culminating from these institutional values and be self-directed learners with the capacity and
educational goals, Kaplan Singapore’s Desired motivation for continued professional learning and
Graduate Attributes are: development; and
• They will be able to critically reflect on their own
Inquiry and criticality: Graduates will be able to practice and evaluate and understand current
critically collect, evaluate and apply information and capacity and further development needs
data in order to make decisions in a wide variety of
professional situations. This attribute is demonstrated Embedded within the desired graduate attributes are
when students: the following skills:
• Conduct research.
• Undertake, evaluate and apply appropriate • Analyse, organise and present data and
research, theories, concepts and tools to information.
investigate problems and find solutions;
• Think and read critically.
• Exercise critical thinking and independent
• Make an oral presentation.
judgement to assess situations and determine
solutions; and • Intellectual curiosity and awareness of culture and
diversity.
• Have an informed respect for the principles,
methods, values and boundaries of their profession • Develop professional ethos and practice that will
and the capacity to question these. foster success in career and life.
• Meet the ever changing needs of communities
Ethicality and discernment: Graduates will be able to now and in the future.
assess situations and respond in an ethically, socially
and professionally responsible manner. This attributed
is demonstrated when students:

• Act responsibly, ethically and with integrity in their


profession;
• Hold personal values and beliefs and participate
in the broad discussion of these values and beliefs
while respecting the views of others;
• Understand the broad local and global economic,
political, social and environmental systems and
their impact as appropriate to their discipline and
profession; and
• Acknowledge personal responsibility for their own
judgments and behaviour

KHE-LCD-SGD-00039 ii
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Table of Contents

Message to Student
Kaplan Desired Graduate Attributes ii
Table of Contents iii
About this module iv
Instructions to Students v
Scheme of Work vi
Assessment Matters x

Topic 1
Present Data 1

Topic 2
Numerical Measures 24

Topic 3
Probability 46

Topic 4
Discrete Probability Distributions 70

Topic 5
Normal Distributions 93

Topic 6
Sampling Distribution 107

Topic 7
Confidence Interval 122

Topic 8
Hypothesis Testing on Population Mean 137

Topic 9
Hypothesis Testing on Difference of Two Population Means 160

Topic 10
Simple Linear Regression 184

KHE-LCD-SGD-00039
iii
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

About this module

Businesses deal with large amounts of data Module Learning Outcomes


to enable decisions to be made. It is important
that data is collected in a valid manner and that Upon successful completion of this module, the
decisions are based on sound statistical principles. student should be able to:
Thus there is a need to be aware of proper
sampling techniques, the rules of probability and • Compute probabilities using basic principles.
the fact that any decision made with incomplete • Determine probability using appropriate
information is prone to involve error. This module discrete and continuous probability distributions.
will introduce a variety of statistical techniques • Conduct statistical inference through
and show under which circumstances each estimation and hypothesis test.
should be used. • Perform simple linear regression analysis.
• Present data and findings in appropriate
In this module, students will collect, tables, charts and numerical measures such that
analyse, present and summarise data to decision-making may be improved.
facilitate decision-making. Students will also
examine how decision making with
incomplete information can be mitigated
through hypothesis testing, using the rules
of probability and probability distributions. Overview of Learning Resources
Tests, which allow comparisons between
groups and model building with multiple Recommended reading:
predictors, will be introduced.
Weiss, N. (2015) Introductory Statistics:
International Edition (10th Ed). Pearson
Education Inc.

Berenson, Levine, & Krehbiel (2011) Basic


Business Statistics – Concepts and Applications
12th Ed. Pearson Education Inc.

Other sources:

See Proquest and Newslink databases linked


to your Elearn LMS homepage. The National
Library Board on North Bridge Road (databases
are for Singaporean/PR only)

KHE-LCD-SGD-00039 iv
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Instructions to Students

How to use this study guide


Activity Sheets
This study guide consists of written notes that
form the main treatise of the subject matter of It is imperative that you sincerely attempt all the
this module. You are strongly advised to study activities in class and document your responses
these notes carefully and thoroughly, as well faithfully. These activity sheets are specially
as, examine the sources that have been cited. designed to scaffold your learning; working
through the tasks is an integral part of
Written quiz and examination will not test beyond developing the desired skills.
the scope of the contents found in the study guide.
However, in order to fully address the Also, by making your thinking visible through the
assessment requirements of the assignment, you activity sheets, it is then possible for your lecturer
will need to research beyond the confines of the to provide you with growth producing feedback
study guide. Nevertheless, the materials herein so that you may improve your performance or
are still a sound basis from which to build the have your doubts clarified.
assignment.

Further supporting materials

The study guide is supplemented by the following:

• Reproduced PowerPoint slides used by the


lecturers
• Activity sheets

PowerPoint Slides

The PowerPoint slides are meant for the lecturers


to signpost the flow of the lesson and for you to
have a visual focus when in class. Outside of
class, they can also serve to help you recall the
activities that took place during the respective
lessons so that you might be reminded of key
learning points.

However, the PowerPoint slides must NOT


replace the need for you to read the written
notes in the study guide. The slides alone are
INSUFFICIENT for you to gain the necessary
understanding of the subject matter. As such,
they will NOT prepare you adequately for the
various summative assessment components.

KHE-LCD-SGD-00039
v
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Scheme of Work

SESSION
TOPICS
FT
1 Topic 01 Present Data
• Basic Concepts
• Data Collection
• Present Qualitative Data
• Present Numerical Data

2 Topic 02 Numerical Measures


• Measures of Central Tendency
• Measures of Variation
3 Topic 03 Probability
• Basic Principles of Probability
• General Additional Rules
• Conditional Probability
4 Practice – Extra Questions

Recap topics 1-3


5 Topic 04: Discrete Probability Distributions
• General Discrete Probability Distribution
• Binomial Distribution
• Poisson Distribution
6 Topic 05 Normal Distributions
• Standard Normal
• Probability of Normal Distribution

Recap of topics 1-5


7 Topic 06: Sampling Distribution
• Sampling Distribution of Population Mean
• Sampling Distribution of Population Proportion
8 Topic 07 Confidence Interval
• Overview of Confidence Interval
• Confidence Interval of Population Mean when σ is known
• Confidence Interval of Population Mean when σ is unknown
9 Exam Preparation – Practice Questions
10 Topic 08 Hypothesis Testing on Population Mean
• Standard Costing
• Variance Analysis
11 Topic 09 Hypothesis Testing on Difference of Two Population Means
• Paired t-Test
• Independent Z-test
• Pooled t-Test
• Non-pooled t-Test
12 Topic 10 Simple Linear Regression
• Regression Analysis
• Correlation Analysis

KHE-LCD-SGD-00039
vi
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Scheme of Work

SESSION
TOPICS
FT
13 Recap topics 6-10

Practice – Extra Questions


14 Exam Briefing

Module Consolidation

KHE-LCD-SGD-00039
vii
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Scheme of Work

SESSION
TOPICS
PT
1 Topic 01 Present Data
• Basic Concepts
• Data Collection
• Present Qualitative Data
• Present Numerical Data

Topic 02 Numerical Measures


• Measures of Central Tendency
• Measures of Variation
2 Topic 03 Probability
• Basic Principles of Probability
• General Additional Rules
• Conditional Probability

Topic 04: Discrete Probability Distributions


• General Discrete Probability Distribution
• Binomial Distribution
• Poisson Distribution
3 Topic 05 Normal Distributions
• Standard Normal
• Probability of Normal Distribution

Recap of topics 1-5


Practice – Extra Questions
4 Topic 06: Sampling Distribution
• Sampling Distribution of Population Mean
• Sampling Distribution of Population Proportion

Topic 07 Confidence Interval


• Overview of Confidence Interval
• Confidence Interval of Population Mean when σ is known
• Confidence Interval of Population Mean when σ is unknown
5 Topic 08 Hypothesis Testing on Population Mean
• Standard Costing
• Variance Analysis

Exam Preparation – Practice Questions

KHE-LCD-SGD-00039
viii
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Scheme of Work

SESSION
TOPICS
PT
6 Topic 09 Hypothesis Testing on Difference of Two Population Means
• Paired t-Test
• Independent Z-test
• Pooled t-Test
• Non-pooled t-Test

Topic 10 Simple Linear Regression


• Regression Analysis
• Correlation Analysis
7 Recap topics 6-10

Exam Briefing

Practice – Extra Questions

Module Consolidation

KHE-LCD-SGD-00039
ix
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Assessment Matters

Assessment Overview Important Policies

Assessment 1: Continuous Assessment Penalties for Plagiarism


(Quiz)
Weighting: 20% Plagiarism in any form is not tolerated by
Date: To be confirmed Kaplan Singapore. That said, direct quotations
Duration: 10 minutes per quiz and general similarities of common terms and
Test Format: 5 MCQs per topic language mean the E-Learn LMS will often pick
up every small similarity so the likelihood of a
Assessment 2: Examination Turnitin Similarity report recording a result of 0%
Weighting: 80% is unrealistic. After all, no technology is perfect
Date: To be confirmed and there is the need for some direct quotation
Duration: 2 hours (provided you reference using APA guidelines,
Eam Format: Module Specific of course) and to use commonly accepted terms
and language.

TOP TIP:
The surest way to succeed is to ensure all work is
correctly referenced. Keep a copy of the Kaplan
Singapore Academic Works and APA
Guide handy when you are typing your
assignments and use it to guide you as to
correct referencing, citation and other aspects of
academic writing.

KHE-LCD-SGD-00039
x
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Penalties for late submissions Assigment Submission: How to Use E-Learn


LMS for Assignment Submission
Kaplan Singapore prepares students for the
realities of the workforce and further education by 1. You will be enrolled by the School of
requiring students to meet deadlines and submit Diploma Studies Programme Management
all work on time. As such, students are required into the E-Learn LMS system only after your
to seek approval and penalties will be imposed fee payment is confirmed.
on late assignment submissions in accordance 2. You will be sent your USER NAME and
with the table below and cited in the Programme PASSWORD via email.
Handbook: 3. Reset your password as prompted.
4. Enter the site at the following address:
No of days late Penalty https://elearn-diploma.kaplan.com.sg
1 – 5 days 10% deduction per day from the 5. To submit assignment please refer to the
marks attained by students. LMS Manual
After 5 days Assignments that are submitted
more than 5 days after the due Please refer to your Student Handbook for more
date will not be accepted and it details on Penalties for Plagiarism, Misconduct,
will be deemed as “No Submis- Examinations Rules and Regulations. Should
sion”. Student will be required to you have any queries, please contact
re-module. [email protected]

Assignments and Kaplan Learning Management


System

Kaplan Singapore School of Diploma


Studies requires you to submit Assignments
through the Learning Management System (E-
Learn LMS). When submitted, your
assignment is checked for plagiarism by
software called Turnitin linked to the E-Learn
LMS. The software is intended to provide one
more tool to improve the quality of academic
writing and as such will be compulsory for use.
It is important to note that this is merely one of
many tools available to you and that final
decisions about the quality of your work rest with
your lecturer.

KHE-LCD-SGD-00039 xi
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Topic 1: Introduction to Accounting
Topic 2: Double Entry Book-keeping and Trial Balance
Topic 3: Final Day Adjustments
Topic 4: Preparation of Basic Financial Statements
Topic 5: Ratios Analysis
Topic 6: Introductiont to Management Accounting
Topic 7: Budgeting
Topic 8: Standard Costing and Variances
Topic 9: Cost Volume Profit Analysis and Decision Making
Topic 10: Capital Budgeting

This page intentionally left blank


KHE-LCD-SGD-00039
xii
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Topic 1: Present Data

Statistics enables you to present data in an effective way. In addition, the


analysis of statistical data could lead to making an informed business decision,
which will enable you to add value to your future company because ‘if you can’t
measure it, you can’t manage it.’

For example, to report students’ performance on the Statistics exam, we need to


collate the exam scores from all tutors and present it in suitable numerical
measures and graphical charts. In this way, the management can have a quick
overview of the students’ performance in this Intake.

If the management suspects that students’ performance has gotten worse, we


could do some relevant statistical analysis to verify this. For example, we could
deduce whether students’ average exam score for this year is significantly lower
than last year.

As we go through this module, you will find plenty of occasions in life and
business which would require statistics. The more you dwell into the world of
statistics, the more exciting the journey will get!

Learning Outcomes

The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:

1. Explain basic terms in Statistics


2. Identify the types of variables
3. Select appropriate sampling and data collection methods
4. Present data in tables and charts

KHE-LCD-SGD-00039 1
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

1.1 Basic Concepts in Statistics

In order to present data appropriately, it is important to understand some basic


concepts in Statistics. The basic concepts being discussed will also give you a
good overview of the module and lay down a good foundation for your navigation
through the world of Statistics.

1.1.1 Two Broad Studies of Statistics

There are two broad studies of Statistics – Descriptive Statistics and Inferential
Statistics (Weiss, 2017).

Descriptive statistics provides simple summaries about the data collected and
about the preliminary observations. Such summaries may be either quantitative
(numerical measures) or visual (e.g. simple-to-understand graphs). These
summaries may either form the basis of the initial description of the data as part
of a more extensive statistical analysis, or they may be sufficient in and of
themselves for a particular investigation.

For example, the shooting percentage in basketball is a descriptive statistic that


summarizes the performance of a player or a team. This number is the number of
shots made divided by the number of shots taken. For example, a player who
shoots 33% is making approximately one shot in every three. We can also
present a simple chart that shows the shooting percentage for a few players to
achieve a visual impact when presenting the information.

KHE-LCD-SGD-00039 2
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Statistical inference is the process of drawing conclusions from data that is


subject to random variation (e.g. observational errors). More substantially, the
terms inferential statistics, statistical inference or statistical induction are used to
describe systems of procedures that can be used to draw conclusions from
datasets arising from systems affected by random variation. The type of
inferential statistical procedure used depends upon the type of data collected as
well as the distribution of the data. The procedures are usually used to do
estimation or test hypotheses.

For example, to deduce whether the average IQ scores of Kaplan students is


more than 110, we could randomly select a small group of students to undertake
an IQ test. Based on their average IQ score, we can perform some appropriate
statistical test to deduce if the average IQ score of Kaplan students is indeed
more than 110. The details will be discussed in later topics.

Descriptive Vs Inferential Statistics


https://www.youtube.com/watch?v=L6hy1CY-OW4

Deeper Overview of Inferential Statistics


https://www.youtube.com/watch?v=RXehwUHcghE

KHE-LCD-SGD-00039 3
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

1.1.2 Population and Sample

A population (Berenson, Levine, & Szabat, 2015) is any entire collection of


people, animals, plants or things from which we can collect data. It is the entire
group we are interested in, which we wish to describe or draw conclusions about.

In order to make any generalization about a population, a smaller group from


within the population known as a sample is often studied. It is desirable that the
sample is representative of the population. For each population there are many
possible samples. A sample statistic gives information about a corresponding
population parameter.

It is important that we carefully and completely define the population and


thereafter collect a sample that is representative of the population.

Example of Population: The collection of ALL Kaplan students


Example of Parameter: Average IQ scores of all Kaplan students
Example of Sample: Our class
Example of Statistics: Average IQ scores of our class

Population vs Sample
LMS Learning Outcome 1.1
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 4
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Go through these questions and tally your answers with the lecturer.

1. A population is a collection of all individuals, objects, or measurements of


interest.
True False

2. A sample is a portion or part of the population of interest.


True False

3. Kaka Ltd has 9,000 workers. 360 staff members were polled regarding a new
wage package to be submitted to management. The population is the 360
members.
True False

KHE-LCD-SGD-00039 5
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

1.1.3 Types of Variables

In statistics, we begin with the list of Variables we would like to analyze. The
answers that we collected are known as Data. A variable is a characteristic being
observed that may assume more than one of a set of values.
Example of variable: Preferred colour of mobile phones
Example of data: black, while, red, blue, yellow

Variable/Data

Qualitative Quantitative
(Categorical) (Numerical )

Discrete Continuous

Variables can further divide into Qualitative (or Categorical) and Quantitative
(or numerical) types (Australian Bureau of Statistics, n.d.).

Example of qualitative data: Colour, gender, race


Example of quantitative data: Number of customers, height, price

Quantitative variables can further divide into Discrete and Continuous types
(Stephanie, 2018).

Example of discrete data: Number of students, number of raining days


Example of continuous data: height, price, productivity growth

Types of Variables/Data
LMS Learning Outcome 1.2
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 6
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Discuss among your classmates on the types of variables for the following. A
good way to work on this is to imagine what kind of data you will be collecting for
each of this variable.

Indicate

1. Modules for this term

2. Number of modules for this term

3. Fee paid for this term

4. Age of students

5. Number of units shipped

6. Unit price

7. GDP Growth

8. Customer Service Rating

9. Revenue

10. Housing Types

KHE-LCD-SGD-00039 7
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

1.2 Data Collection

In this section, we will discuss the issues related to data collection. In particular,
we will discuss the ‘Sources of Data’ and ‘Sampling Methods’.

1.2.1 Sources of Data


There are two sources of data (University of Rochester, 2018), Primary and
Secondary sources.

Primary data are those collected by the investigator conducting the


research. These are original materials on which research is based. There
are many ways to obtain the data like experiment, survey, observation, etc.

Secondary data refers to data that was collected by someone other than
the researcher. Secondary sources offer interpretation or analysis based
on primary sources. They may explain primary sources and often uses
them to support a specific argument or persuade the reader to accept a
certain point of view.

KHE-LCD-SGD-00039 8
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The following are examples showing how you could collect primary data:

• Experiment
For a batch of 50,000 new mobiles phones that were manufactured,
the Factory Manager can collect a random sample of 100 phones,
fully charge the phone and determine the average stand-by time

• Survey
To determine whether Kaplan students are happy with the campus
facilities, we can conduct a survey on 200 students

• Observation
To decide whether the duration of 2 hours for Statistics exam is
sufficient, the lecturer can observe the time taken by a group of
students during the next Statistics exam

KHE-LCD-SGD-00039 9
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

1.2.2 Sampling Methods


Sampling methods are used to select a sample from within a general population.
Proper sampling methods are important for eliminating bias in the selection
process (Barratt, 2009).

For example, if Kaplan wants to conduct a survey involving 100 of its students, it
would have to ensure that it the sample of 100 students represents all Kaplan
students. If Kaplan only surveys all local students, this will be considered as a
bias since this does not represent the actual distribution of Kaplan students by
countries.

Common methods of sampling include simple random, systematic sampling,


stratified sampling and cluster sampling.

A simple random sample is a subset of individuals chosen from the population.


Each individual is chosen randomly and entirely by chance, such that each
individual has the same probability of being chosen at any stage during the
sampling process. Simple random sampling is commonly used. This method is
similar to the lucky draw process. Usually, we do this through statistical software.

The most common form of systematic sampling is an equal-probability method.


In this approach, progression through the list is treated circularly, with a return to
the top once the end of the list is passed. The sampling starts by selecting an
element from the list at random and then every kth element in the frame is
selected, where k, the sampling interval, is calculated as follows:

where n is the sample size, and N is the population size.

Note that we would need to serialize the population for this process to work.

KHE-LCD-SGD-00039 10
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

In Stratified sampling, we divide the entire target population into different


subgroups, or strata, and then randomly select the number of individuals
proportionally from the different strata. Usually, we stratify the population by one
or more characteristics (e.g. gender, age). This method is more resource
intensive though. In a population census, we usually use this method to obtain a
sample.

In Cluster sampling (also known as multistage sampling), we divide the


population into some convenient groups. For example, we could cluster Kaplan
students into classes (say 50 per class). We can then form a sample by randomly
select a number of clusters and include all individuals in these clusters as part of
the sample. For example, if we want a sample of 500 Kaplan students, we can
randomly pick 10 classes to form a sample of 500 students.

Sampling Methods
https://www.youtube.com/watch?v=pTuj57uXWlk

KHE-LCD-SGD-00039 11
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

1.3 Present Data in Tables and Charts

In this section, we shall discuss the various methods to present data in tables
and charts.

You will discover that it is very important to differentiate the types of data so that
we could present the information using appropriate tools. Therefore, before you
begin this section, please do a quick review on the difference between qualitative
and quantitative data.

1.3.1 Present Qualitative Data

We will discuss the tools that are commonly used to present qualitative data
(Deborah, 2016). Recall that qualitative data are also known as categorical data.
Here are some examples of qualitative variables:

 Your Country
 Types of Diploma
 Favourite Colour

In particular, we will be learning how to tabulate qualitative data into a Summary


Table and present it in Bar Charts and Pie Charts.

KHE-LCD-SGD-00039 12
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Summary Table

A table is a symmetric arrangement of statistical data in rows and columns. Rows


are horizontal arrangements whereas columns are vertical arrangements.

Race in Singapore Percentage


Chinese 74%
Malays 14%
Indians 9%
Others 3%

For qualitative data, we will create a Summary Table in which the first column
indicates the name of the variable (e.g. Race in Singapore for the above table).
We than create another column indicating the percentage for respective category.
We could also create another column showing the frequency for each category, if
necessary. Frequency is defined as the number of occurrence of that category.

The following is an example of how we construct the above Summary Table:

A sample of 1,000 Singaporeans was being interviewed. We noted that there


were 740 Chinese, 140 Malays, 90 Indians and 30 other races. As such, we can
compute the following for the Chinese category:

Frequency of Chinese = 740

Percentage of Chinese = 100% = 74%

We can similarly work out the percentage of the rest of the categories and
display in according to the above table.

KHE-LCD-SGD-00039 13
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Bar Chart

A Bar Chart is a common tool used to present qualitative data. A bar chart or bar
graph is a chart with rectangular bars with lengths proportional to the values that
they represent. The bars can be plotted vertically or horizontally. A vertical bar
chart is sometimes called a column bar chart.

Race in Singapore
Others
Indians
Malays
Chinese

0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00%

One axis of the chart shows the specific categories being compared, and the
other axis represents the percentage or frequency. In this example, you will
notice that we have use the vertical axis to shows the races (Chinese, Malays,
Indians & Others). On the horizontal axis, we have shown the percentage.

In practice, it is preferred to show percentage rather than frequency. This is


because, by showing the percentage of each category, it allows the viewer to
compare the size of each category with the overall data collected.

KHE-LCD-SGD-00039 14
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Pie Chart

A Pie Chart is another commonly used tool to present qualitative data. Pie charts
can be an effective way of displaying information in some cases, in particular if
the intent is to compare the size of a slice with the whole pie, rather than
comparing the slices among them.

Others
3% Race in Singapore
Indians
9%
Malays
14%

Chinese
74%

A pie chart is a circular chart divided into sectors, illustrating proportion. In a pie
chart, the size of each sector (and consequently its angle), is proportional to the
quantity it represents. The angle of a whole circle is 360º and it represents 100%.
As such, an angle of 3.6⁰ will represent 1% of the proportion. Therefore, to
represent 74% of Chinese, we would need an angle of 74 x 3.6 = 266.4º.

Present Qualitative Data


https://www.youtube.com/watch?v=tPjKTkepLqE

KHE-LCD-SGD-00039 15
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

1.3.2 Present Quantitative Data

We have seen how we could present qualitative data in Summary Tables, Bar
Charts and Pie Charts. We shall now discuss how we could present
quantitative data.

Recall that quantitative data is numerical data. The following are some examples
of quantitative data:
 Number of Modules in this Term
 Number of Students in the QA Classes
 School Fee for Diploma Courses

We shall learn to present quantitative data in tables and charts (Deborah,


2016). In particular, we will be learning how to construct

 Frequency Distributions Table


 Cumulative Distribution Table
 Histogram
 Cumulative Percent Curve (or Ogive)

KHE-LCD-SGD-00039 16
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example: A manufacturer of insulation randomly selected 20 winter days and


recorded the daily high temperature (in ºF):

24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27

It is always wise to arrange your data in increasing order before we proceed:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Frequency Distribution Table

A frequency distribution table shows the different measurement categories, the


number of observations and percentage in each category. Before constructing a
frequency distribution table, you should have an idea about the range (minimum
and maximum) of the data. The range is divided into arbitrary intervals called
“class intervals.” If the class intervals are too many, then there will be no
reduction in the bulkiness of data and minor deviations also become noticeable.
On the other hand, if there are too few class intervals, then the distribution of the
data cannot be clearly shown. Generally, 5–12 intervals are adequate.

In this example, we noted that the data ranges from 12 to 58. As such, we could
use five class intervals as shown:

Temp Frequency Percentage

10 to 20 3 15
20 to 30 6 30
30 to 40 5 25
40 to 50 4 20
50 to 60 2 10
TOTAL 20 100

KHE-LCD-SGD-00039 17
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Cumulative Frequency Distribution Table

A table showing the cumulative frequencies and percentages is called a


Cumulative Frequency Distribution Table. This table is an extension of the
frequency distribution table. We need to create two additional columns –
Cumulative Frequency and Cumulative Percentage.

Cumulative frequency of a class interval is the sum of the frequency of the class
and all frequency that is less than the class. For example, for the class interval
“30 to 40”, the cumulative frequency is 3 + 6 + 5 = 14. Similarly, we can work out
the cumulative frequencies of other classes as well as the cumulative percentage.

Temp Frequency Percentage Cumulative Cumulative


Frequency Percentage

10 to 20 3 15 3 15

20 to 30 6 30 9 45

30 to 40 5 25 14 70

40 to 50 4 20 18 90

50 to 60 2 10 20 100

TOTAL 20 100

Cumulative Frequency Distribution Table


https://www.youtube.com/watch?v=5dKpNRSDKr8

KHE-LCD-SGD-00039 18
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Histogram

A Histogram is a graphical representation showing a visual impression of the


distribution of data. The histogram consists of tabular frequencies, shown as
adjacent rectangles, erected over the class intervals, with an area equal to the
frequency of the observations in the interval. In most cases, we use equal class
intervals. As such, the height of the rectangle can be used to represent the
frequency instead of the area. Besides using ‘frequency’ as the vertical axis, we
could also use the ‘percentage’ for the vertical axis.

You may note that histograms are very similar to bar charts. The differences are:
 Bar charts are for categorical data while histograms are for numerical data
 Bar charts have gaps between each bar while histograms do not have
gaps
 Bar charts can be displayed vertically or horizontally while histograms are
usually displayed vertically

Histogram
https://www.youtube.com/watch?v=YLPDPglvePY

KHE-LCD-SGD-00039 19
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Ogive

An Ogive, also known as Cumulative Percent Curve, is a graph showing the


curve of a cumulative distribution function. We use the upper class boundaries
(i.e. the upper limit number of each class) as the horizontal axis and plot against
the cumulative percentage. Note that when joining up the points, we want to
create as smooth an ‘S’ curve as possible. When reading the Ogive, for example,
a value of 40 ºF at 70% will mean out of 70% of the days the daily high
temperature is less than 40 ºF.

Histogram and Ogive with Excel


https://www.youtube.com/watch?v=x8ePdM9LquM

KHE-LCD-SGD-00039 20
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The following table shows the daily high temperature (in ºF) of a village in China.
Construct a Histogram and Ogive using the following frequency distribution:

Temperature Number
of Days

10 to 20 1

20 to 30 3

30 to 40 5

40 to 50 4

50 to 60 2

TOTAL 15

KHE-LCD-SGD-00039 21
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Summary

Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.

 Descriptive & Inferential Statistics


 Population vs Sample
 Types of Variable
 Source of Data & Sampling Methods
 Present Qualitative (categorical) Data
 Present Quantitative (numerical) Data

KHE-LCD-SGD-00039 22
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

REFERENCES

Australian Bureau of Statistics. (n.d.). Statistical Language - Quantitative and


Qualitative Data [webpage]. Retrieved from
http://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+languag
e+-+quantitative+and+qualitative+data

Barratt, 2009. Methods of Sampling from a Population. Retrieved from


https://www.healthknowledge.org.uk/public-health-textbook/research-
methods/1a-epidemiology/methods-of-sampling-population

Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.

Deborah, J. (2016). Statistics For Dummies. New York, United States: John
Wiley & Sons Inc.

Stephanie. (2018). StatisticsHowto.com [webpage]. Retrieved from


http://www.statisticshowto.com/discrete-vs-continuous-variables/

University of Rochester. (2018). Primary and Secondary Sources. Retrieved from


https://www.library.rochester.edu/Primary-secondary%20sources

Weiss, N. (2017). Introductory Statistics. England: Pearson Education Ltd.

KHE-LCD-SGD-00039 23
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Topic 2: Numerical Measures

In this topic, we shall discuss the various methods to summarise numerical data
into useful measurements. Firstly, we will introduce the three measurements of
central tendency – Mean, Median and Mode. Thereafter, we will learn how to
measure variation of data using both Standard Deviation and Interquartile
Range. Finally, we will also introduce some important symbols for parameters
and statistics. To do all these computations, you will need a non-programmable
scientific calculator.

Before we begin, recall that a population consists of all the items or individuals
about which you want to draw a conclusion. We usually do not have the
population. As such, we collect a sample which is a subset of the population for
analysis. Therefore, in this topic, we shall emphasize the measurements for a
sample. We will briefly discuss the measurements of the population at the end of
the topic.

Learning Outcomes

The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:

1. Compute mean, median and mode


2. Compute Standard Deviation and Interquartile Range
3. Distinguish the symbols commonly used for parameters and statistics

KHE-LCD-SGD-00039 24
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

2.1 Measures of Central Tendency

We shall begin with the Measures of Central Tendency for a sample data.
Central tendency relates to the way in which quantitative data tend to cluster
around some value. A measure of central tendency is any of a number of ways of
specifying this "central value". In practical statistical analysis, the terms are often
used before we have chosen even a preliminary form of analysis. Hence, an
initial objective might be to choose an appropriate measure of central tendency.
In this section, we shall look at three measures of central tendency – Mean,
Median and Mode (Berenson,2015).

2.1.1 Sample Mean

Sample Mean is just the average from a set of data.

Sample mean is the most common measure of central tendency and is important
in performing statistical analysis. For example, sample mean could be used to
determine the average score of the QA quiz for our class.

To compute the sample mean, we need to add up all the data and divide by the
total number of observations.

Example: 3, 5, 6, 8, 9

35 689
X  6 .2
5

KHE-LCD-SGD-00039 25
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Sample mean is easily affected by extreme values, also known as outliers


(Manikandan, 2011). An extreme value is a number in the data set that is far from
the rest of the data. Therefore, in situations where the data contain extreme
values, sample mean may not be a suitable measure of central tendency.

Besides learning the computation of sample mean, do note the symbol X


(pronounced as x-bar) which represents sample mean and the symbol n which
represents sample size (i.e. the number of data in the sample).

Sample Mean
https://www.youtube.com/watch?v=lBlnjzHVUYU

2.1.2 Sample Median

Sample Median is the numerical value separating the higher half of a sample
from the lower half. The sample median can be found by arranging all the
observations from lowest value to highest value and picking the middle one.

Example A: 3, 5, 6, 8, 9

If there is an even number of observations, then there is no single middle value;


the median is then usually defined to be the average of the two middle values.

KHE-LCD-SGD-00039 26
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example B: 3, 5, 6, 8, 9000

Comparing Example A and B, we note that the only difference is the number 9 in
Example A is replaced by 9000 in Example B. Nevertheless, we observed that
the sample median still remains as 6. Therefore, unlike sample mean, sample
median is not affected by extreme values (Rehill, n.d.).

If we have a large sample, it may be more efficient to work out the median
position of the ordered data to locate the middle value. The formula for the
median position is as follows where n is the sample size discussed earlier:

n 1
Median position  of the ordered data
2

• If the sample size is odd, the median is the middle number


• If the sample size is even, the median is the average of the two middle
numbers

KHE-LCD-SGD-00039 27
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example:

0 0 5 7 8 9 12 14 22 33

Middle two
values

Median Position = (10 + 1) / 2 = 5.5

Therefore, Median = (8 + 9) / 2 = 8.5

Referring to the above example, we have a sample of 10 data (i.e. n = 10). Firstly,
we need to rearrange the data from smallest to biggest. Thereafter, we use the
formula provided to determine the median position as 5.5.

Since 5.5 is between 5 and 6, we need to take the average of the 5th value (i.e. 8)
and the 6th value (i.e. 9). Finally, we can conclude that the median is 8.5
(average of 8 and 9).

Note that median position is not the median; it is just the location of the median.
As such, when you compute the sample median, do use the presentation
provided above as your example. In addition, always remember to arrange your
data first when you compute the sample median.

Sample Median
https://www.youtube.com/watch?v=SMwRMkvxik0

KHE-LCD-SGD-00039 28
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

2.1.3 Sample Mode

Sample Mode is the value that occurs most frequently in the sample.

Example A:

In Example A, the mode is 9 since this number appears the most number of
times.

Example B:

In Example B, since all the numbers appeared the same amount of times, we
conclude that the sample has no mode.

Example C:

In Example C, we noticed that the numbers 1 and 5 appeared most frequently


and with the same number of times. As such, we conclude that the modes are 1
and 5.

Measures of Central Tendency


LMS Learning Outcome 2.1
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 29
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

2.1.4 Shape of Distribution

In statistics, the concept of the Shape of the Distribution refers to the shape of
a probability distribution and it most often arises in questions of finding an
appropriate distribution to model the statistical properties of a population when
given a sample (MathBits, 2018).

We will learn the concept of probability distribution in subsequent topics. For now,
it suffices to be able to identify the shape of the distribution of a sample through a
histogram.

As we will discover in later topics, the most important shape of distribution is the
Bell Curve. This is a symmetrical distribution. In this case, the mean is equal to
the median.

When the mean is less than the median, we will regard this as a left-skewed
distribution and the shape of the curve is shown on the left picture on the above.

On the other hand, when the mean is more than the median, we have a right-
skewed distribution as shown on the right picture on the above.

Shape of Distribution
https://www.youtube.com/watch?v=A_8cfQJeqjs

KHE-LCD-SGD-00039 30
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Find the mean, median and mode of the following data:

$440, $490, $550, $390, $280, $390

Is it left- or right-skewed?

KHE-LCD-SGD-00039 31
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

2.2 Measures of Variation

The measures of central tendency give us a feel for the centre of the data.
However, these measurements are not sufficient as we do not have a good feel
as to whether the data are close together or widely spread out.

To achieve this, we need another set of measurements known as Measures of


Variation. We will be learning two of such measures in this topic – Standard
Deviation and Interquartile Range. These measures will give information on the
spread or dispersion of the data. A small value will indicate that the data are
closer to each other while a very big value tells us that the data are far apart from
each other.

2.2.1 Standard Deviation

We shall now take a look at the first measure of variation – Sample Standard
Deviation. To obtain the standard deviation, we need to compute the Sample
Variance then take the square-root of this value (ThoughtCo, n.d.).

𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛, 𝑆 √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒

The following is the formula for sample variance:

 (X  X)
i
2

S2  i1
n -1

KHE-LCD-SGD-00039 32
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The formula may seem challenging at a first glance but it is quite easy to apply.
Take a look at the following example and you will be more comfortable with it.

Example: 10 12 14 15 17 18 18 25

The standard deviation will have the same unit of measurement as the data. For
example, if the data refers to the heights of students (in metres) then the
standard will also have the same unit – metres.

These are the steps to compute the standard deviation:

Step 1: Compute the sample mean (16.1 for this example)

Step 2: Compute the sample variance (20.98 for this example)

Step 3: Compute the sample standard deviation by taking the square root of
variance (4.6 for this example)

Do also pay attention to the symbols used for sample variance (s2) and sample
standard deviation (s). These symbols will come in handy when we deal with
later topics.

Standard Deviation
LMS Learning Outcome 2.2
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 33
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

7, 9, 10, 11, 13, 17

Find the
 Mean

 Variance

 Standard Deviation

KHE-LCD-SGD-00039 34
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Comparing Standard Deviations

We mentioned that standard deviation is a common measurement of the


variation of data. We shall now take a look at how we could interpret the
standard deviation.

Standard deviation shows how much variation or dispersion exists from the mean.
Very loosely, we can use standard deviation as an indication of the “average gap”
between the data and the sample mean. A low standard deviation indicates that
the data points tend to be very close to the mean; high standard deviation
indicates that the data points are spread out over a large range of values.

Take a look at the above three examples. All three examples have the same
sample mean (15.5). By looking at the mean, we will have no clue about the
variations of the three samples. If we examine the sample standard deviation, we
will notice that Sample B has the smallest value which indicated that the data are
close to the sample mean. Conversely, Sample C has a big standard deviation
which indicated that the data are far away from the sample mean.

KHE-LCD-SGD-00039 35
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Coefficient of Variation

We have learned how to compute standard deviation and use it to indicate


variation of data. However, at times, it is not easy to compare the spread of data
between samples. This is particularly so when the means and standard
deviations for the samples are of difference values.

In such cases, we need to derive the Coefficient of Variation from the sample
mean and standard deviation as shown in the following formula:

 S 
CV     100
 X 

The coefficient of variation represents the ratio of the standard deviation to the
mean in percentage. It is a useful statistic for comparing the degree of variation
from one sample data to another, even if the means are drastically different from
each other.

In the above examples, both have the same standard deviation ($5) but their
means differ. When we compute the coefficient of variation, we note that Sample
B has a much higher value (50%). Therefore, we conclude that Sample B has a
bigger variation although both have the same sample standard deviation.

KHE-LCD-SGD-00039 36
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

7, 9, 10, 11, 13, 17

Continue from previous Activity, find the Coefficient of Variation.

KHE-LCD-SGD-00039 37
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

2.2.2 Interquartile Range

The next measure of variation is the interquartile range. To compute this statistic,
we would need to learn the concept of Quartiles first. The Quartiles of a sample
are the three points that divide the data (indicated by arrows in the following
diagram) set into four equal groups, each representing a fourth of the population
being sampled (Weiss, 2017).

The first quartile (Q1) is the value in which 25% of the observations are smaller
than this value.

The second quartile (Q2) is the value in which 50% of the observations are
smaller than this value. You may also realize that the second quartile is actually
the median.

The third quartile (Q3) is the value in which 75% of the observations are smaller
than this value.

Therefore, there are actually three quartiles in a sample.

We can use the above formulas to work out the Quartiles Positions. Recall that
n is the symbol for the sample size.

KHE-LCD-SGD-00039 38
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

When we divide a number by 4, we may not obtain a whole number. As such, we


need to introduce some Rounding Rules as follows:

 If the result is a whole number then it is the ranked position


 If the result is a fractional half (e.g. 2.5, 7.5, 8.5, etc.) then
average the two corresponding data values
 If the result is not a whole number or a fractional half (i.e.
others) then round the result to the nearest integer

The following is an example to illustrate how to apply the rounding rules:

If n = 11, then Q1 position = (11 + 1)/4 = 3. Hence, we take the 3rd value as Q1.

If n = 12, then Q1 position = (12 + 1)/4 = 3.25 ≈ 3. Hence, we take the 3rd value
as Q1.

If n = 14, then Q1 position = (14 + 1)/4 = 3.75 ≈ 4. Hence, we take the 4th value as
Q 1.

If n = 13, then Q1 position = (13 + 1)/4 = 3.5. In this case, just like the way we
deal with median, Q1 will be the average of the 3rd and 4th values.

As a general rule, if the quartiles positions are fractional half (e.g. 2.5, 7.5, 8.5),
we would need to average the two adjacent values to obtain the respective
quartiles. Otherwise, we just round off the value to the nearest whole number to
locate the respective quartiles.

KHE-LCD-SGD-00039 39
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The following example illustrates how we compute the quartile position and the
respective quartiles. Do note that we need to arrange our data before we
proceed with the computation. You may want to give some attention to the
presentation of the workings too. Do not mix up quartile positions (which are the
locations of the quartiles) and the quartiles.

Example: 11, 12, 13, 16, 16, 17, 18, 21, 22

Q1 position = (9+1)/4 = 2.5 so Q1 = (12+13)/2 = 12.5

Q2 position = 2(9+1)/4 = 5 so Q2 = 16

Q3 position = 3(9+1)/4 = 7.5 so Q3 = (18+21)/2 = 19.5

The Interquartile Range (IQR) is defined as the difference between Q1 and Q3.

IQR = Q3 – Q1

IQR will have the same unit as the data. Note that within the IQR, it will contain
the middle 50% of the data.

If the IQR is small, we can conclude that the middle 50% of the data are close to
the median. Hence, the variation of the data is small.

If the IQR is very large, we can conclude that the middle 50% of the data are far
apart. In this case, the data at the two ends will be even further apart. Hence, the
variation of the data will be large.

Example:

Continue from previous example,

IQR = Q3 – Q1 = 19.5 – 12.5 = 7

Quartiles and Inter-Quartile Range


LMS Learning Outcome 2.3
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 40
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

$44, $49, $55, $39, $28, $39, $38, $56, $59, $64

a) Find Q1, Q2 and Q3

b) Find the IQR

KHE-LCD-SGD-00039 41
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

2.3 Numerical Measures for a Population

So far, we have learned three measures of central tendency (mean, median and
mode) and two measures of variation (standard deviation and interquartile range).

We mentioned earlier that we generally work with samples instead of the


population. As such, all these calculations are dealing with samples. We shall
now have a quick look at some Numerical Measures for Population
(Pennsylvania State University, 2018).

The first population parameter we are looking at is the Population Mean.

where

μ = population mean (pronounce as “meu”)

N = population size

This is just the average of all the values in the population. In usual practice, we
do not have the population and hence there is not much opportunity to apply the
formula to compute the population mean. Nevertheless, it is important to note the
symbol µ (pronounce as meu) which represents the population mean.

The formulas for the Population Variance are shown above. The computations
are very similar to the sample variance. Like the population mean, it is more
important to be familiar with the symbols σ2 (read as sigma-square)

Similar to the computation for sample, we can determine the Population


Standard Deviation by take the square-root of the variance. Its symbol is σ
(pronounce as sigma).

KHE-LCD-SGD-00039 42
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The following table shows the key symbols which will be used frequently in the
discussion of subsequent topics:

You are strongly encouraged to commit these symbols into memory. You will
soon find out that it is very much easier to understand the concepts and interpret
the questions when you are very familiar with these symbols.

KHE-LCD-SGD-00039 43
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Summary

Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.

 Measurement of Central Tendency


- Mean, Median and Mode
- Shape of Distribution

 Measurement of Variation
- Standard Deviation, Coefficient of Variation
- Quartiles, IQR

 Symbols for Parameters and Statistics

KHE-LCD-SGD-00039 44
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

REFERENCES

Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.

Manikandan, S. (2011). Measures of central tendency: The mean. Journal of


Pharmacology & Pharmacotherapeutics, 2(2), pp. 140-142.

MathBits. (2018). Shapes of Distributions. Retrieved from


https://mathbitsnotebook.com/Algebra1/StatisticsData/STShapes.html

Pennsylvania State University. (2018). Basic Terminology. Retrieved from


https://onlinecourses.science.psu.edu/statprogram/reviews/statistical-
concepts/terminology

Rehill, GS. (n.d.). Mean, Median and Mode. Retrieved from


https://www.mathsteacher.com.au/year8/ch17_stat/02_mean/mean.htm

ThoughtCo. (n.d.). How to Calculate a Sample Standard Deviation. Retrieved


from https://www.thoughtco.com/calculate-a-sample-standard-deviation-
3126345

Weiss, N. (2017). Introductory Statistics. England: Pearson Education Ltd.

KHE-LCD-SGD-00039 45
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Topic 3: Probability

In this topic, we will learn some basic concepts of probability. This will enable you
to be familiar with the meaning of probability and equip you for more challenging
concepts of probability distributions in later topics.

Learning Outcomes

The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:

1. Compute probability using basic rules and visualization tools


2. Differentiate the different types of events
3. Compute probability using General Additional Rule
4. Compute conditional probability
5. Explain the concept of mutually exclusive and independent events

KHE-LCD-SGD-00039 46
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

3.1 Basic Probability Concepts

Probability is derived from the word probably. For example, if you ask your friend
whether he is going to the party tonight, he may answer ‘probably’ or ‘may be’. If
you ask further what are the chances that he will be there? He may say a 70%
chance. In terms of probability, it is considered as 0.7; just divide the percentage
by 100.

Probability is the measure of how likely an event is to occur (Berenson, 2015).


Probability is always between 0 and 1 inclusively. The higher the probability of
an event, the more certain we are that the event will occur.

A probability of zero means that the event will definitely not happen. For example,
if we roll a dice and we want to obtain a ‘7’, this is practically not possible and
hence the probability is zero.

A probability of one means the event will surely happen. For example, if we roll a
dice and we want the number that appears to be less than 7. As we know, a
normal dice will only have numbers 1 to 6. Therefore, we are very sure the
number appearing will be less than 7 and hence this event will occur with a with
probability of 1.

Recall that in Topic 1 we have discussed the two types of variables (Qualitative
and Quantitative) as shown above. In this topic will be introducing a few generic
formulas and probability principles which apply to all situations. Nevertheless, in
our example, we will apply it specifically to the categorical variables. We will
discuss probability for discrete and continuous variables in subsequent topics.

KHE-LCD-SGD-00039 47
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

3.1.1 Basic Probability Formula

n( E )
P( E ) 
Total
An event is a set of outcomes which are of interest to us. Probability describes
the statistical number of outcomes considered divided by the number of all
outcomes (PISHRO-NIK, 2014).

We usually use the notation ‘E’ to represent the event of interest. The upper case
‘P’ stands for probability. For example, P(E) will stand for the probability of the
event E occurring. The notation ‘n(E)’ stands for the number of outcomes in the
event E. The denominator ‘Total’ in the above formula refers to the total number
of outcomes.

Example:

Find the probability that today is a raining day:

No. of Days

Raining Days 148

Fine Days 212

Total 360

n(Raining) 148 37
P( Raining )   
Total 360 90

In this example, the event is 'today is a raining day'. We observe that there were
148 raining days out of a total of 360 days. Therefore, we can apply the
probability formula to obtain the probability of raining.

KHE-LCD-SGD-00039 48
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Basic Probability Formula


https://www.youtube.com/watch?v=WqTioYM0L7I

3.1.2 Types of Events

We shall now introduce the three basic types of events that we generally
observe in the determination of probability (Pennsylvania State University, 2018).

A Simple Event is an event that has a single characteristic.


 For example, in selecting a card, the event “selecting a black card’ is a
simple event. There is only one criterion – black.

A Joint Event requires two or more characteristics to happen concurrently. This


is usually denoted by the conjunction ‘and’.
 For example, the event ‘selecting an Ace and Black card’ is a joint event.

A Complement of an Event is the set of all outcomes that are Not included in
the outcomes of the event. This is denoted by A’. A simple way to consider the
complement of an event is to negate the event.
 For example, the complement of ‘selecting a diamond card’ is NOT
selecting a diamond cards (i.e. selecting hearts, spades or clubs).

KHE-LCD-SGD-00039 49
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

3.2 Visualize Events

We will introduce the following three common tools that are used to visualise
events:

 Contingency Table
 Tree Diagram
 Venn Diagram

These tools will assist you to appreciate the situation by displaying the events in
a pictorial form. Thereafter, we could determine the desired probability much
easier with the assistance of these tools.

3.2.1 Contingency Table

Marital Status

Single Married Total

Gender Male 65 25 90

Female 48 72 120

Total 113 97 210

A Contingency Table is a type of table in a matrix format that displays the


frequency distribution of two categorical variables (Kling, n.d.).

In the above example, we have used the contingency table to display the joint
frequencies of Gender and Marital Status. From the table in the above slide, we
observe the following:

 Number of Male and Single: 65


 Number of Male: 90
 Number of Single: 113
 Total Number of Observations: 210

KHE-LCD-SGD-00039 50
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example:

Find the:

a) P(Pass Stats and Male)

b) P(Pass Stats)

a)

b)

KHE-LCD-SGD-00039 51
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Jane has 10 pairs of shoes::

New Old Total

Black 3 2 5
White 1 1 2
Blue 2 1 3
Total 6 4 10

Calculate
1. P(new blue shoe)

2. P(old shoe)

3. P(not Black Shoe)

KHE-LCD-SGD-00039 52
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

3.2.2 Tree Diagram

The second visualization tool is the Tree Diagram. A tree diagram is a


representation of a tree structure, a way of representing the hierarchical nature of
a structure of events in a graphical form (Statistics How To, 2018).

In the above example, we displayed at the first level the outcomes of your QA
course - Pass or Fail. These events are represented by two branches starting
from the same common point. We displayed the probability for each outcome on
the respective branch. Noted that the total probabilities of the branches is 1. This
will always be so since total probabilities is always 1.

The second level further branch out from the first-level outcomes. If you pass QA,
you could go for party or a holiday. If you Fail QA, you could remodule or quit the
course. Note again that we indicated the probabilities on the branches.

In this way, if we are interested in the event ‘Pass and Party’, we can trace the
desired branch accordingly. Usually, we make use of the tree diagram to display
conditional probabilities. We will discuss this in later part of this topic.

KHE-LCD-SGD-00039 53
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

3.2.2 Venn Diagram

The third visualizing tool is the Venn Diagram. A Venn diagram is a diagram that
shows all possible logical relations between events (Lucidchart, 2018). In the
above example, Event A is represented by the smaller circle while Event B is
represented by the bigger circle. The event A and B is represented by the
overlapping region and the Event A or B is represented by the total region within
the two circles.

Venn Duagram
https://www.youtube.com/watch?v=b6t0994ZZDA

KHE-LCD-SGD-00039 54
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

If P(A) = 0.3, P(B) = 0.5, P(A and B) = 0.1, use Venn Diagram to find

a) P(A and B’)

b) P(A or B)

KHE-LCD-SGD-00039 55
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

3.3 General Addition Rule

So far, we only learnt one formula that can be used to determine the probabilities
of a simple event and joint events. We shall now discuss another second formula
as stated above which is known as the General Addition Rule (Weiss, 2017).

This formula is to determine the probability of an event A or B. As discussed


previously using the Venn diagram, the event ‘A or B’ includes outcomes in A
only, B only and also the common region A and B.

As a general approach, whenever we need to find the probability of A or B, we


can apply the General Addition Rule directly. Thereafter, the problem will be
simplified to finding probabilities of two simple events (i.e. event A, event B) and
one joint event (i.e. event A and B). We can then revert back to the first formula
for this purpose.

KHE-LCD-SGD-00039 56
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example:

P(Pass Stats or Female)

= P(Pass) + P(Female) - P(Pass and Female)

= 84/108 + 49/108 - 36/108 = 97/108

KHE-LCD-SGD-00039 57
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

New Old Total

Black 3 2 5

White 1 1 2

Blue 2 1 3

Total 6 4 10

Find P(Black or Old).

KHE-LCD-SGD-00039 58
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

3.4 Mutually Exclusive Events

Previously, we had discussed three basic types of events - Simple event, Joint
event and Complementary event. We shall discussion one more type of event
known as Mutually Exclusive Events.

Events A and B are said to be mutually exclusive if when event A occurs then
event B cannot happen. Vice-versa, if event B happens then event A cannot
happen (Varadhan, 2001).

Consider the following example:

 Event A: Draw a Queen of Heart from a deck of cards


 Event B: Draw a Queen of Spade from a deck of cards

Each event is possible to happen by itself. However, it is not possible to draw a


Queen of Heart and at the same time a Queen of Spade. Therefore, these two
events are mutually exclusive.

If A and B are mutually exclusive, then

P(A and B) = 0

By the definition of mutually exclusive, we can conclude that the joint probability
of two mutually exclusive events is zero. i.e. P(A and B) = 0 when A and B are
mutually exclusive.

Hence, using the General Addition Rule, if events A and B are mutually exclusive,
P(A or B) = P(A) + P(B).

KHE-LCD-SGD-00039 59
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The following table shows the joint distribution for engineers and scientists by
highest degree obtained:

Engineer Scientist Total


Bachelor 34 27 61
Master 19 12 31
Doctorate 3 5 8
Total 56 44 100

1. Determine P(Master and Engineer).

2. Determine P(Master or Engineer).

3. Are the events ‘Master’ and ‘Engineer’ mutually exclusive? Why?

KHE-LCD-SGD-00039 60
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

3.5 Conditional Probability

Before we discuss the formula for conditional probability, it is important to fully


appreciate the Concept of Conditional Probability. Otherwise, you may face
difficulty differentiating the occasions where conditional probabilities are more
appropriate.

A conditional probability is the probability that an event will occur, when another
event is known to occur or have already occurred (Berenson, 2015). Consider
two events A and B, if we knew that event B has already occurred and we want
to find the probability of event A, we would require the conditional probability of A
given that B has occurred. The symbol we use to represent this is P(A|B). We
read this as ‘probability of A given B’.

Consider the following example:

Event A: You will come to school today


Event B: Today is your QA exam

The probability of event A can be easily determined by looking at your


attendance record. Let’s say you attend classes 80% of the time then P(A) = 0.8.
This is the probability of a simple event which we had done previously.

However, imagine we know that today is your QA exam. That is, we know that
event B is sure to happen (a given condition). Therefore, the probability of 'You
will come to school' (event A) given that 'Today is your QA exam' (event B) is
going to change. This probability is a conditional probability -- P(A|B).

P(A and B)
P(A | B) 
P(B)
The Formula for Conditional Probability turns out to be quite easy. It is just the
joint probability of event A and B divided by the probability of event B. As
mentioned earlier, the challenge is not in applying the formula but in recognizing
the situations where conditional probability is required.

KHE-LCD-SGD-00039 61
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example:

It is noted that among all Kaplan Students, 70% are foreign students, 40% are
male and 20% are both. What is the probability that a student is male, given that
he is a foreign student?

We note that:

P(Foreign) = 0.7, P(Male) = 0.4 & P(Foreign & Male) = 0.2

P(Male | Foreign)
P(Male and Foreign) 0.2 2
  
P(Foreign) 0.7 7

Probability and Contingency Table


https://www.youtube.com/watch?v=AELR4O5RVl4

KHE-LCD-SGD-00039 62
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Purchased Purchased Total


iPhone Other Phone

Download Games 38 42 80
Never download Games 70 150 220
Total 108 192 300

If a student who purchased iPhone is randomly selected, what is the chance that
he will download Games?

KHE-LCD-SGD-00039 63
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

3.6 Multiplication Rule

Using simple algebra manipulation, we can rewrite the conditional probability


formula to the following:

P(A and B) = P(A|B) P(B)

In other words, the joint probability of A and B can be written as the multiplication
of the conditional probability P(A|B) and the probability P(B). We refer to the
above expression as Multiplication Rule (Berenson, 2015).

Let’s revisit the Tree Diagram we learnt earlier. When dealing with conditional
probabilities, the first level of the tree are the probabilities of simple events.
However, when we extend to the second level, we need to show the conditional
probabilities.

In the above diagram, the first branch breaks out to events B and B’. P(B) and
P(B’) are stated on the branches. Following from each branch (i.e. B and B’), we
extended it to events A and A’. Notice that in the second level, we indicated
respective conditional probabilities such as P(A|B), P(A’|B), etc.

To work out the joint probability of A and B, we just trace the appropriate branch
and multiply the probabilities along the path. i.e. P(A and B) = P(B) x P(A|B).
Similarly, P(B’ and A) = P(B’) x P(A|B’).

You may have realised by now that we are actually using the multiplication rule
by doing the above. Therefore, the tree diagram is actually an easy way to apply
multiplication rule using visualization!

KHE-LCD-SGD-00039 64
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

On his way to office, Mr Tan walks past a newspaper store every morning. He is
likely to buy a Straits Times or Business Times (not both) but in some days he
does not buy any papers. The chance that he will buy Straits Times is 0.5 and
the chance that he will buy Business Times is 0.3. If he bought Straits Times, he
is likely to bring home in the evening with a 70% chance. If he bought Business
Times, then the chance is 20%.

Using the tree diagram, find the probability that Mr Tan didn’t bring any
newspaper home tonight?

KHE-LCD-SGD-00039 65
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

3.7 Independent Events

Two events are Independent if the occurrence of one event does not change the
probability of the other occurring (Weiss, 2017). In other words, the two events
are not related.

For example, event A represents obtaining a 2 when rolling a dice and event B
represents obtaining a Head when flipping a coin. We say that these two events
are independent since rolling a 2 does not affect the probability of flipping a head
and vice-versa.

We can present the Definition of Independence mathematically by stating

P(A|B) = P(A).

The expression explains that the probability of A happening, given that B has
happened, remains unchanged. That is, B has no effect from the occurrence of A,
which is exactly the meaning of independence.

Recall that we have learnt earlier the multiplication rule:

P(A and B) = P(A|B) P(B).

If A and B are independent, then by definition, P(A|B) = P(A). Hence, we have


derived the above joint probability for independent events. Namely,

P(A and B) = P(A) x P(B) when A and B are independent.

Probability of Independent and Dependent Events


https://www.youtube.com/watch?v=jos1yBC_L8E

KHE-LCD-SGD-00039 66
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

1. A female student has just sat for the Statistics exam, what is the
probability that she will fail?

2. Are the events Male and Pass Stat independent?

KHE-LCD-SGD-00039 67
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Summary

Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.

 Basic Probability Formula

 Types of Events

 Visualizing Events

 General Addition Rule

 Conditional Probability

 Multiplication Rule and Tree Diagram

 Mutually Exclusive & Independent Events

KHE-LCD-SGD-00039 68
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

REFERENCES

Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.

Kling, A. (n.d.). Contingency Tables. Retrieved from


http://arnoldkling.com/apstats/contingency.html

Lucidchart. (2018). What is a Venn Diagram. Retrieved from


https://www.lucidchart.com/pages/venn-diagram

Pennsylvania State University. (2018). Basic Terminology. Retrieved from


https://onlinecourses.science.psu.edu/statprogram/reviews/statistical-
concepts/terminology

PISHRO-NIK, H. (2014). Introduction to Probability, Statistics, and Random


Processes. Unite States: Kappa Research, LLC.

Statistics How To. (2018). Probability Tree Diagrams. Retrieved from


http://www.statisticshowto.com/how-to-use-a-probability-tree-for-
probability-questions/

Varadhan, S. (2001). Probability Theory. Unite States: American Mathematical


Society

Weiss, N. (2017). Introductory Statistics. England: Pearson Education Ltd.

KHE-LCD-SGD-00039 69
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Topic 4: Discrete Probability Distributions

In this topic, we shall examine the Probability Distributions of Discrete Variables.


We will first introduce some basic concepts relating to discrete probability
distributions. Thereafter, we will discuss two important distributions – Binomial
Distribution and Poisson Distribution.

Learning Outcomes

The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:

1. Compute probability, mean and standard deviation of a general discrete


probability distribution.
2. Compute probability, mean and standard deviation of a Binomial
Distribution
3. Compute probability, mean and standard deviation of a Poisson
Distribution

KHE-LCD-SGD-00039 70
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

4.1 General Discrete Probability Distributions

You may recall that there are two types of variables – Qualitative (categorical)
and Quantitative (numerical). For quantitative variables, we can further divide
them into Discrete and Continuous variables. For this topic, we are dealing with
probabilities of discrete variables.

Firstly, we will discuss the concept of Probability Distribution and how we could
construct this for discrete variables. Thereafter, we will compute the mean and
standard deviation for discrete probability distributions.

A Discrete Random Variable is a variable that counts the number of desired


observations (Berenson, 2015). For example, when we flip a coin five times and
we let X be the number of times where ‘head’ will appear. This is considered a
discrete random variable because:
 It is a variable since X can takes varying values (0, 1, 2, 3, 4 or 5).
 It is random since the value of X for each experiment is not predictable.
 It is discrete from the fact that X could only take fixed numbers that are not
connected.

KHE-LCD-SGD-00039 71
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Number of Modules Taken Probability

2 0.2

3 0.4

4 0.24

5 0.16

The above table shows the Probability Distribution of a discrete variable. Here,
the variable of interest is ‘Number of Modules Taken’. Note that this is a discrete
variable since the possible values are 2, 3, 4 or 5 (fixed numbers and not
connected).

The column on the right shows the probability for each possible value. Sum up
the probabilities on the right column and you should get the total probabilities of 1.
The table presents the distribution of the probabilities for a discrete variable and
it is therefore known as discrete probability distribution.

To Construct a Discrete Probability Distribution, we begin by listing down all


the possible outcomes for the variable. For example, if we flip a coin twice and
we let X represents the number of times ‘head’ will appear, then X can take
values 0, 1 and 2. We list these values of the left column as shown in the
following table:

KHE-LCD-SGD-00039 72
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Next, for each value of X, we compute the respective probability using our basic
probability formula. Make sure that you do a final check by adding up all the
probabilities and it should be equal to one. So, we have managed to present the
discrete probability distribution in a table form.

x Probability
0 0.25

1 0.5

2 0.25

Take a look at the chart presented below:

Did you notice that this chart shows exactly the same information presented in
the probability distribution table? We can determine the probability of each value
of X by reading the chart. Hence, the above chart is also considered as
probability distribution.

Besides presenting discrete probability distributions in tables and charts, another


common way is to present the distribution as a function P(x) – a formula in x. To
find the probability for each value of X, we substitute the respective value into the
function. We shall examine this in greater details later.

Therefore, discrete probability distributions can appear in three forms – table,


chart or as a function.

Discrete Probability Distribution


https://www.youtube.com/watch?v=bxrudsvTUsg&list=PLMbMBtorF9o4Ydkei
msqS8lbP7nZcQp0Y

KHE-LCD-SGD-00039 73
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Once we have constructed the probability distribution, we can compute the Mean
for a Discrete Probability Distribution, using the following formula. The mean
is also known as the Expected Value (nzmaths, n.d.).

N
μ  E(X)   X i P( X i )
i 1

Example:

x P(x)
0 0.25

1 0.5

2 0.25

µ = 0 x 0.25 + 1 x 0.50 + 2 x 0.25 = 0.625

The population mean is calculated by multiplying each value of X with its


respective probability and sum together.

Note that we have use the symbol µ, population mean, to represent the mean.
This is because probability distribution always refers to the population and hence
its mean is always referred to the population mean.

What exactly does the mean of a discrete probability distribution indicate?


Consider the situation where we flip a dented coin twice and we let X be the
number of heads. The values of X could be 0, 1 or 2 in this case. Let's assume
that the probability distribution of X is as shown in the above exam.

We can repeat this experiment many times (say 1 million times) and note down
the value of X for each experiment. Take the average of these values and the
answer is likely to be very close to the mean we found above (i.e. 1.25) since we
have a very large sample. Theoretically, if we repeat the experiment infinitely
many times (i.e. we get the population for X), the average value of X will be
exactly 1.25.

KHE-LCD-SGD-00039 74
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Once we have determined the mean, we can use the following formula to
compute the Variance and Standard Deviation of Discrete Probability
Distributions (Morganstern, 2013). Recall from Topic 2, whenever we want to
determine the standard deviation, we need to find the variance first and then take
square root of it.

• Variance of a discrete random variable

σ 2   [X i  μ] 2 P (X i )

• Standard Deviation of a discrete random variable

σ  var iance

Example:
x P(x)
0 0.25

1 0.5

2 0.25

σ2 = (0 - 1.24)2 x 0.25 + (1 - 1.24)2 x 0.50 + (2 - 1.24)2 x 0.25 = 0.5875

σ  0.5875  0.766

KHE-LCD-SGD-00039 75
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

x P(x)

0 0.40
1 0.30

2 0.10

3 0.15

4 0.05

• Find Mean.

• Find the variance and standard deviation.

KHE-LCD-SGD-00039 76
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

4.2 Binomial Distributions

We have discussed the concepts of probability distribution, mean and standard


deviation for discrete probability distributions. In most business applications, we
do not construct the probability distribution from scratch. Instead, we make use of
some common Probability Distribution for Discrete Variables (Gordon, 1997).
Here are some examples:
 Binomial Distribution
 Poisson Distribution
 Geometric Distribution
 Discrete Uniform Distribution

To use a known probability distribution model, we have to firstly ensure that the
situation fit the conditions of the model. Thereafter, we can make use of the
known result (formulas) such as the probability distribution function, mean and
standard deviation that describe the model. In this section, we will examine one
very important model - Binomial Distribution (Weiss, 2017).

There are four main conditions in a Binomial model:

1) In a Binomial experiment, each observation is categorised into two


outcomes. The outcome that we are interested in is regarded as ‘success’
and its compliment is regarded as ‘failure’.

2) The Binomial experiment will only have a fixed number of observations


(denoted by n).

3) For each observation, the probability of getting the outcome that we are
interested (success) should be the same (denoted by π).

4) In addition, each observation must be independent of the others.

KHE-LCD-SGD-00039 77
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Here are some common business applications that can be fitted into a Binomial
model. In general, whenever we want to determine the number of occurrences of
a certain characteristic (i.e. success) when there is a fixed number of
observations (i.e. n), we can immediately consider the possibility of fitting the
situation into a Binomial model.

• The number of car accidents out of 1000 cars

• The number of flu cases out of 50 patients

• The number of manufacturing defects out of 10,000 pieces

• The number of people respond to a charity show out of 5 million


population

Binomial Model
https://www.youtube.com/watch?v=59v5aZ8NMpk

Once we have determined that we can use the Binomial model, we can define X
by using the following prescribed format:

X: Number of <<success>> out of <<total >>


(X = 0,1, 2, . . . n)

For example, X could represent ‘Number of heads out of 20 tosses’.

KHE-LCD-SGD-00039 78
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The probability function for Binomial model is as follows:

where

n: number of trials

π: Probability of success in each trial

P(x) is known as the probability function. In this case, since the formula is
applicable to Binomial model, the above formula is called Binomial probability
function or in brief, Binomial Distribution. By substituting the desired value of x
into P(x), we can determine the probability of x success occurring in n
observations.

KHE-LCD-SGD-00039 79
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

a) Supposedly we toss a dented coin 10 times and the chance that the coin
shows a Head in each throw is 10%. Is this Binomial model? What is n and π?

n= π =

b) Substitute the value of n and π into the Binomial distribution:

P(x) =

c) If we are interested to determine the probability of obtaining 3 Heads then X =


3. Substitute the value of X into the Binomial distribution and determine the
probability:

P(3) =

Binomial Distribution
LMS Learning Outcome 4.1
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 80
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Suppose there is a 2% chance of purchasing a defective computer. What is the


probability of purchasing 2 defective computers in a group of 10?

Let X: Number of ______________ out of ____________

n= π= X=

P(x) =

KHE-LCD-SGD-00039 81
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Note that:

 P(X ≤ 2) = P(X = 0) + P(X = 1) + P(X = 2)

 P0 + P1 + P2 + P3 + . . . . + Pn = 1 so

P(X > 2) = 1 - P(X ≤ 2)

If n = 10, π = 0.1, find

a) P(X = 2)

b) P(X ≤ 2)

c) P(X > 2)

d) Probability of at most 2

e) Probability of at least 2

KHE-LCD-SGD-00039 82
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

We have observed that the moment we could fit a situation into a Binomial model,
we can immediately use the Binomial probability function to determine the
desired probabilities. Besides this, we can also determine the Binomial Mean
and Binomial Standard Deviation (Weiss, 2017) by using the following formulas:

Binomial Mean and Standard Deviation


LMS Learning Outcome 4.2
https://elearn-diploma.kaplan.com.sg

Suppose there is a 2% chance of purchasing a defective computer. If 10


computers were purchased, what is the mean, variance, and standard deviation
for the number of defective computer?

n= π=

Mean =

Variance =

Standard Deviation =

KHE-LCD-SGD-00039 83
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

A company produces a latest model of camera in large quantity. From past


record, the defective rate is known to be 3%. Every day, the company will sample
a batch of 50 cameras to check for manufacturing error.

Calculate the

a) probability of no camera is defective,

b) probability that more than 2 cameras are defective,

c) mean and standard deviation for the number of defective cameras

KHE-LCD-SGD-00039 84
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

4.3 Poisson Distributions

In the previous topics, we discussed the concept of probability distributions of


discrete variables. We have also examined Binomial distribution and its
applications. In this topic, we will discuss another variables probability distribution
known as the Poisson Distribution (Sharma, 2014).

These are the conditions of a Poisson model:

 Each observation has only 2 outcomes – success or failure


 e.g. car accident, obtaining a head, typo error

 A period of observations
 e.g. in 1 day, 5 minutes, within 1 page

 The average number of “success” within the “period” is  (lambda)


 e.g. an average of 2 car accidents in a day

 Observations are independent


 The outcome of one observation does not affect the others

Note that the Binomial and Poisson models share some similarities. Both models
are for discrete variables and they are used to count the number of desired
observations (i.e. success).

For a Binomial experiment, the process is terminated by a fixed number of


observations (i.e. n). On the other hand, the Poisson experiment terminates after
a time interval (i.e. period).

KHE-LCD-SGD-00039 85
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Here are some common business applications that can be fitted into a Poisson
model:

• The number of coffee sold in 1 hour

• The number of flu cases in 1 day

• The number of calls in 3 hours

• The number of typo errors in one page

In general, whenever we want to determine the number of occurrences of a


certain characteristic (i.e. success) over a time interval (i.e. period), we can
immediately consider the possibility of fitting the situation into a Poisson model.

Apart from observing the number of success over a time intervals, the Poisson
experiment also applies to a region of space. For example, over a page or over a
length of road.

Once we have determined that we can use the Poisson model, we can define X
by using the following prescribed format:

X: Number of <<success>> within <<period>>


(X = 0,1, 2, . . . )

KHE-LCD-SGD-00039 86
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The probability function for Poisson Model is as follows:

e   x
P( X) 
X!
where

λ: Average number of success within a period

P(x) is known as the probability function. In this case, since the formula is
applicable to Poisson model, the above formula is called Poisson probability
function or in brief, Poisson Distribution. By substituting the desired value of x
into P(x), we can determine the probability of x success occurring within the
given period.

a) Supposedly λ = 0.5, substitute this value into the Poisson distribution:

e   x
P( X) 
X!
P(x) =

b) If we are interested to determine the probability of obtaining 3 success then X


= 3. Substitute the value of X into the Poisson distribution and determine the
probability:

P(3) =

KHE-LCD-SGD-00039 87
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

If λ = 1.2, find

a) P(X = 4)

b) P(X < 2)

c) P(X ≥ 2)

d) Probability of at most 1

e) Probability of at least 2

KHE-LCD-SGD-00039 88
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

We have observed that the moment we can fit a situation into a Poisson model,
we can use the Poisson probability function to determine the desired probabilities.
Besides this, we can also determine the Poisson Mean and Poisson Standard
Deviation (Sharma, 2014) as follows:

Poisson Distribution
LMS Learning Outcome 4.3
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 89
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

From past records, the owner of a fast food restaurant knows that, on average,
8.3 cars use the drive-through windows in 1 hour. Assuming that this event
follows a Poisson probability distribution, calculate the

a) probability of 10 cars use the drive-through windows within an hour,

b) probability of 20 cars use the drive-through windows within two hours,

c) mean and standard deviation for the number of cars that use the drive-
through windows per hour

KHE-LCD-SGD-00039 90
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Summary

Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.

 Construct a Discrete Probability Distribution

 Mean and Standard Deviation of Discrete Probability Distribution

 Binomial Probability Distribution

 Binomial Mean and Standard Deviation

 Poisson Probability Distribution

 Poisson Mean and Standard Deviation

KHE-LCD-SGD-00039 91
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

REFERENCES

Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.

Gordon, H. (1997). Discrete Probability (Undergraduate Texts in Mathematics.


USA: Springer.

Morganstern, R. (2013). Discrete Probability. United Kingdom: Createspace


Independent Publishing Platform.

nzmaths. (n.d.). Expected value (of a discrete random variable). Retrieved from
https://nzmaths.co.nz/category/glossary/expected-value-discrete-random-
variable

Sharma, JK. (2014). Business Statistics. India: Vikas Publishing House Pvt Ltd.

Weiss, N. (2017). Introductory Statistics. England: Pearson Education Ltd.

KHE-LCD-SGD-00039 92
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Topic 5: Normal Distributions

In the previous topic, we discussed the concept of probability distributions of


discrete variables. In particular, we have examined two important distributions –
Binomial and Poisson distributions.

In this topic, we will introduce the concept of Probability Distributions for


Continuous Variables. We will first introduce some basic concepts relating to
continuous probability distributions. Thereafter, we will discuss an important
distribution – Normal Distribution. This distribution is very important as it is the
foundation for inferential statistics.

Learning Outcomes

The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:

1. Explain the differences between discrete and continuous probability


distributions

2. Compute the probability of standard normal distribution

3. Compute the probability of any normal distribution

KHE-LCD-SGD-00039 93
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

5.1 Continuous Probability Distributions

You may recall that there are two types of variables – Qualitative (categorical)
and Quantitative (numerical). For quantitative variables, we can further divide
them into Discrete and Continuous variables. For this topic, we are dealing with
probabilities of continuous variables.

Firstly, we will discuss the general approach to determine probability from a


Continuous Probability Distribution. Along the way, we will also attempt to
highlight the differences in approaches between discrete and continuous
probability distributions.

Recall that when we deal with probabilities for discrete variables, we would need
to construct probability distributions. In Binomial and Poisson models, we actually
have formulas for probability distributions. The symbol we used for probability
distribution is P(x). This is also known as a probability mass function.

KHE-LCD-SGD-00039 94
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

For continuous variables, we will not have probability mass functions P(X).
Instead, we will be using Probability Density Function, denoted by f(x), to
represent the distribution (Mejlbro, 2009). A probability density function is a
function that describes the relative likelihood for this random variable to take on a
given value. Depending on the formula for f(x), we can obtain various curves
when we plot f(x) against x. Here are some examples:

If X is a random variable taking values between 0 and 1, what do you think is the
probability of obtaining a single value (say X = 0.3) in this interval?

The answer for the above question is actually zero. There are infinitely many
points within any interval. As such, the probability of obtaining one single point
(out of infinitely many points) is zero. Therefore, there is no meaning in asking for
probability of one point for a continuous interval. In continuous probability
distributions, we will always compute probability within an interval.

For example:

 P(X < 3.8)

 P(X > 3.8)

 P(-0.2 < X < 5.6)

KHE-LCD-SGD-00039 95
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

This is very different from the approach for discrete variables. In discrete
probability distributions, we always have to work out the probabilities one point at
a time. For example, in discrete probability distributions, if we want P(X ≤ 1), we
need to work out P(X = 0) and P(X = 1) individually before adding up the
probabilities.

You may recall that in discrete probability distributions, if we want to find the
probability of a point (say X = 3), we will substitute the value X = 3 into the
probability mass function to obtain the answer.

For continuous variables, the approach is different. To find the probability within
an interval, we need to find the area under the probability density function
within the interval. For example, to find P(a < X < b), we need to find the area
under the curve of the probability density function for the X value between a and
b. The desired area is indicated in the shaded area:

KHE-LCD-SGD-00039 96
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Since area represents probability in continuous distributions, the total area under
the probability density function (i.e. the curve) is 1.

Continuous Probability Distribution


https://www.youtube.com/watch?v=OWSOhpS00_s

KHE-LCD-SGD-00039 97
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

5.2 Normal Distributions

We have discussed the concept of probability density function and the approach
to find probability for continuous variables. We shall now introduce a very
important continuous probability distribution known as Normal Distribution
(Berenson, 2015).

The probability density function of a Normal Distribution is as follows:

As you can see, the formula of the probability density function for the Normal
distribution is rather scary. Not to worry, as we have mentioned earlier, for
continuous variables, the formula is actually not important. What we need is
actually the curve of the probability density function.

The shape of the normal distribution resembles that of a bell, so it sometimes is


referred to as the "bell curve".

The bells curve has the following


characteristics:
 It is symmetrical
 The mean, median and mode are equal
 Extends to positive and negative infinity
 Area under the curve is equal to 1

The normal distribution can be completely


specified by two parameters:
 Mean (µ)
 standard deviation (σ)

KHE-LCD-SGD-00039 98
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The Empirical Rule (TutorVista.com, n.d.) states that for a normal distribution,
approximately:
 68% of the data will fall within 1 standard deviation of the mean
 95% of the data will fall within 2 standard deviations of the mean
 Almost all (99.7%) of the data will fall within 3 standard deviations of the
mean

The normal distribution is a widely observed distribution. In most continuous


variables we observed in business and daily lives (e.g. weight, salary, exam
scores), we can see that they behave almost exactly like the empirical rule
(Gremmell, 2016).

We mentioned that the normal distribution can be completely specified by the


population mean and standard deviation. As a concise expression, we usually
use the following notation to represent a normal distribution:

X ~ N(µ,σ 2)
We read the above notation as 'X follows a normal distribution with mean equal
to µ and variance equal to σ 2 '. We will be using this notation frequently in the
next topic.

Normal Distribution
https://www.youtube.com/watch?v=iYiOVISWXS4

To find the area under the bell curve directly is not easy. We usually make use of
a statistical table to do so. A Statistical Table is a data sheet that provides pre-
computed values of areas under the curves for a desired probability density
function. In the case of normal distribution, we will be using a statistical table for
the standard normal distribution.

KHE-LCD-SGD-00039 99
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

A Standard Normal Distribution is a normal distribution with mean equal 0 and


standard deviation equal to 1 (Weiss, 2017). We commonly denote the standard
normal distribution by the symbol Z. We shall learn to find the probability for the Z
distribution first. Thereafter, we will learn to convert a general normal distribution
to a Z distribution.

Example: Find P(Z < 2.00)

Refer to the Standard Normal Table (also known as Z table) at the end of this
book. There are two pages - the first page is for negative Z values while the
second page is for positive Z value.

The values indicated inside the tables refer to the left areas for the respective Z
values. For example, if we want to find P(Z < 2.00), we can look for Z = 2.00 in
the table and note the value provided (in this case it is 0.9772). The value 0.9772
is the left area from Z = 2.00 under the standard normal curve. This area is as
shown above.

Therefore, P(Z < 2.00) = 0.9772

It is important for you to be familiar with the Z table. We will be using this table
very frequently in subsequent topics. The following information may be useful for
the next exercise:

 P(X > a) = 1 - P(X ≤ a)

 P(a < X < b) = P(X < b) - P(X ≤ a)

Compute Probability of Standard Normal Distribution


LMS Learning Outcome 5.1
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 100
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Find

a) P(Z < 1.96) d) P(Z > 1.96)

b) P(Z < 0.12) e) P(Z > -1.53)

c) P(Z < -1.53) f) P(0.12 < Z < 1.96)

Probability of Z Distribution
https://www.youtube.com/watch?v=zZWd56VlN7w

KHE-LCD-SGD-00039 101
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

We have learned how to determine the probability for a Z distribution. You’ll recall
that Z distribution is a normal distribution with mean equal to 0 and standard
deviation equal to 1.

For any normal distribution with mean equal to µ and standard deviation equal to
σ, we need to use the following formula to scale it to a standard normal
distribution. And once we have the Z equivalent distribution, we can do exactly
like what we did previously.

X μ
Z
σ

Example:

Let X represents the time it takes to download an image file from the internet.

Suppose X is normally distributed with mean 8.0 and standard deviation 5.0. Find
P(X < 8.6).

Note: μ = 8, σ = 5

P(X < 8.6)

= 0.5478

Find Probability of Normal Distribution


LMS Learning Outcome 5.2
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 102
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Suppose X is normal with mean 8.0 and standard deviation 5.0. Find
a) P(X > 8.3)

b) P(7.6 < X < 8.6)

Probability of Normal Distribution


https://www.youtube.com/watch?v=p_KApjpyBHE

KHE-LCD-SGD-00039 103
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Intelligence Quotients (IQs) is a score derived from one of several tests designed
to assess intelligence. It is suspected that the IQ scores of Kaplan Diploma
students has direct link to their exam performance in Statistics.

Kaplan has conducted a recent study and noted that the IQ scores of Kaplan
students is normally distributed with mean 102 and standard deviation 15.

a. If a student is randomly selected, what is the probability that his IQ score is

i. More than 110

ii. Between 100 and 115

b. What is the IQ score for the top 10% of the students?

KHE-LCD-SGD-00039 104
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Summary

Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.

 Difference between discrete and continuous probability distributions

 Standard normal distribution

 Probability of normal distribution

KHE-LCD-SGD-00039 105
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

REFERENCES

Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.

Gremmell, D. (2016). The Lowdown on the Empirical Rule. Retrieved from


http://www.bizscisolutions.com/blog/the-lowdown-on-the-empirical-rule

Mejlbro, L. (2009). Continuous Distributions. UK: London Business School.

TutorVista.com. (n.d.). Empirical Rule. Retrieved from


https://math.tutorvista.com/statistics/empirical-rule.html

Weiss, N. (2017). Introductory Statistics. England: Pearson Education Ltd.

KHE-LCD-SGD-00039 106
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Topic 6: Sampling Distribution

In the previous topic, we have discussed the concept of continuous probability


distributions and have learned how to determine probabilities for normal
distribution.

In this topic, we will discuss the concept of Sampling Distribution and examine
Sampling distributions of Sample Mean and Sample Proportion.

Learning Outcomes

The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:

1. Explain the concept of sampling distribution

2. Compute the probability of sample mean

3. Compute the probability of sample proportion

KHE-LCD-SGD-00039 107
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

6.1 Sampling Distribution of Sample Mean

The Sampling Distribution of a statistic (e.g. sample mean X ) is the probability


distribution of that statistic, considered as a random variable, when derived from
a random sample of size n (Berenson, 2015). It may be considered as the
probability distribution of the statistic for all possible samples from the same
population of a given size.

If you have five numbers: {1, 2, 3, 4, 5}.

You want to form samples of 3 numbers (e.g. {1, 2, 3}).

How many samples you can form?

Note that {1, 2, 3} and {3, 2, 1} is still the same sample.

KHE-LCD-SGD-00039 108
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

In the previous activity, we considered a population of four elements and we


noted that we can form 10 samples (5C3). For each sample, we computed the
sample mean. This is just an artificial situation as the population is much larger in
general. Let's extend this concept to a real-life example.

Consider the population of all the individual heights of Kaplan students. Suppose
we repeatedly take samples of a given size (n) from this population and calculate
the average for each sample (i.e. the sample mean 𝑥̅ ). Each sample has its own
average value, and the probability distribution of these averages is called the
Sampling Distribution of the Sample Mean (Sharma, 2014).

We will discover later that the sampling distribution depends on the underlying
distribution of the population, the statistic being considered and the sample size
used.

Sampling Distribution
https://www.youtube.com/watch?v=Zbw‐YvELsaM

KHE-LCD-SGD-00039 109
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

We have introduced the concept of sampling distribution and used the example
of the sample mean to illustrate the idea. We shall now examine the Sampling
Distribution of Sample Mean in greater detail.

The Population Mean of the Sampling Distribution of Sample Mean is exactly


the same as the population mean of the underlying population. That is, if the
mean of the underlying population is µ then the mean of the population of sample
means (of a given sample size) is also µ.

The population standard deviation of the sampling distribution of sample mean


(commonly known as Standard Error) is the equal to the standard deviation of
the underlying population divided by square root of the sample size. That is, if the
standard deviation of the underlying population is σ then the standard error is
/ n (Lane, n.d.).

KHE-LCD-SGD-00039 110
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

We shall now introduce two theories relating to the sampling distribution of


sample means.

Theory 1:

  
2

X ~ N ( , ) 2
 
X ~ N  ,    
  n 

Theory 1 states that if the underlying population is normally distributed then the
sampling distribution of sample means will also be normally distributed (Weiss,
2017). The mean and standard deviation of the sampling distribution will be µ
and / n as per our previous discussion.

Theory 2 (Central Limit Theorem):

X not normal   
2

 X ~ N   ,    
and n  30   n  

Theory 2 states that if the underlying population is NOT normally distributed but
the sample size that we use is large (greater or equal to 30), then the sampling
distribution of sample means will be approximately normally distributed
( Weisstein, 2018). The mean and standard deviation of the sampling distribution
will be µ and / n as per previous discussion.

Theory 2 is of significant importance. This is known as the Central Limit


Theorem. This theorem essentially states that even if we do not know the
distribution of the underlying population, we only need to ensure that the sample
is large enough (i.e. more than or equal to 30) then the sampling distribution of
the sample mean will be normal.

Central Limit Theorem


https://www.youtube.com/watch?v=rBjft49MAO8

KHE-LCD-SGD-00039 111
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

In the last topic, we learnt that for any normal distribution, we need to scale it to
the standard normal distribution before we can determine the probabilities. You’ll
recall that the standard normal distribution (or Z distribution) is a normal
distribution with a mean equal to 0 and standard deviation equal to 1.

From Theories 1 and 2, when the conditions are met, the sampling distribution of
sample mean will be normally distributed with mean equal to µ and standard
deviation equal to  / n . Therefore, we could convert the normally distributed
sampling distribution to Z distribution using the following formula:

X μ
Z
σ
n
Example (Theory 1):

X is normally distributed with mean 15 and standard deviation 7. For a sample of


16, find the probability that the sample mean is greater than 18.

i.e.   15,   7, n  16

  
2

X ~ N (15,7 )2
 
X ~ N 15,  7  
  16  

P ( X  18)
 1  0.9564
 0.0436

X μ 18  15
Z   1.71
 7
n 16

KHE-LCD-SGD-00039 112
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The above example illustrates the case when the underlying population is
normally distributed. Here, we need to quote Theory 1 to support the fact that the
sampling distribution of the sample mean is normally distributed.

The underlying population has µ = 15 and σ = 7. We want to find P( X > 18) when
the sample size n = 16. Our first task is to sketch the normal distribution curve
for X . Notice that when the axis is X , the centre is 15 (which is the value for µ).
We indicate the value X = 18 and shade the required right area.

Our second task is to construct another Z axis directly below the X axis. Recall
that the centre for the Z axis is zero. We now need to convert the value X = 18
to a corresponding Z value. To do this, we have used the formula we just
introduced. We note that the corresponding value is Z = 1.71. We indicate this
value directly below X = 18. Therefore, to find P( X > 18) is the same as finding
P(Z > 1.71). In other words,

P( X  18)  P(Z  1.71)  1  0.9564 0.0436

You may have realised that the approach we used here is very similar to the
previous topic. In fact, these are essentially the same except that the formula we
used for the conversion to the Z axis is different now.

Probability of Sample Mean (theory 1)


LMS Learning Outcome 6
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 113
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example (Theory 2):

A population has a mean 8 and standard deviation 3. Suppose a random sample


of size 36 is selected, what is the probability that the sample mean is between
7.8 and 8.3?

i.e. μ = 8, σ = 3, n = 36
X not normal   
2

 X ~ N  8,  3  
but n  36  30   36  

P (7.8  X  8.3)
 0.7257  0.3446
 0.3811

X μ 7.8  8 X μ 8.3  8
Z   0.4 Z   0.6
 3  3
n 36 n 36

The above example illustrates the case when the underlying population is not
normally distributed. Here, we need to quote Theory 2 to support the fact that the
sampling distribution of the sample mean is normally distributed.

You would have realized that the approach we used here is almost the same as
the previous example. The only difference is that we are using Theory 2 instead
of Theory 1 to explain that X is normally distributed.

KHE-LCD-SGD-00039 114
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The height of Kaplan students has mean 1.7m and standard deviation 0.2m. If a
sample of 40 students are selected, what is the probability that the mean height
is between 1.6 and 1.75m?

 ,  , n

KHE-LCD-SGD-00039 115
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

6.2 Sampling Distribution of Sample Proportion

We have discussed sampling distribution of the sample mean and introduced two
theories that determine when the distribution is normal. We shall now examine
another sample statistic – Sample Proportion.

π = Proportion of the population having some characteristic

p = Proportion of the sample having same characteristic

n(characteristic)
p Note: Proportion is
sample size between 0 and 1.

The sample proportion (denoted by p) is the fraction of a sample which has a


certain desired characteristic. Similarly, the Population Proportion (denoted by
π) is the fraction of the population which have the same characteristic. Note that
since proportion is a fraction of the whole sample or population, it is between 0
and 1 (Stine & Foster, 2014).

Example:

Suppose we know that 40% of Kaplan students like fast food. In this case, the
characteristic of interest is “like fast food” and the population proportion is 0.4. If
we take a sample of 200 Kaplan students and note that 90 of them like fast food
then the sample proportion is 0.45 (from 90/200).

KHE-LCD-SGD-00039 116
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

p is approximated by a normal distribution if:

nπ ≥ 10 and n(1 – π) ≥ 10

It is easy to determine whether the sampling distribution of the sample proportion


is normal. We only need to verify that nπ ≥ 10 and n(1 - π) ≥ 10 (Rumsey, n.d.).
The limit of 10 is just a rough guide. Some may use other limits and for
consistency, we shall use the limit of 10 in this course.

Once we know that the sample proportion, p, is normally distributed, we can use
the following formula to convert the sampling distribution of p to the Z distribution
just like the way we deal with X :

pπ
Z
π( 1  π)
n

Sampling Distribution of Sample Proportion


https://www.youtube.com/watch?v=fuGwbG9_W1c

KHE-LCD-SGD-00039 117
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example:

If the proportion of Kaplan students who like fast food is 0.4, what is the
probability that, for a sample of 200 students being interviewed, the sample
proportion is between 0.40 and 0.45?

Note: n= 200, π = 0.4

nπ = 200 x 0.4 = 80 ≥ 10 and n(1 – π) = 200 x 0.6 = 120 ≥ 10

Therefore, p is normal.

P( 0.40  p  0.45 )
 0.9251  0.5
 0.4251

p  0.45  0.4
Z   1.44
 (1   ) 0.4(1  0.4)
n 200

In this example, we note that π = 0.4, n = 200 and we want to find P(0.4 < p <
0.45). Since we need to find the probability relating to the sample proportion p,
we need to confirm that the sampling distribution is normal. As shown above, we
verified that nπ and n(1 - π) are both greater than 10. Therefore, the sample
proportion (p) is normally distributed.

Similar to the case of sampling distribution of sample mean, we convert the p


axis values to the Z values using the formula. Note that the centre of the p axis is
the value of π. Once we are in the Z axis, we can determine the desired
probability accordingly.

KHE-LCD-SGD-00039 118
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The marketing department has done an analysis recently to determine the source
of students for its Diploma courses. It is noted that 55% of the students are from
China. If a sample of 100 students are randomly selected, what is the probability
that more than 70% are from China?

KHE-LCD-SGD-00039 119
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Summary

Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.

 Concept of Sampling Distribution

 Sampling Distribution of Sample Mean

 Sampling Distribution of Sample Proportion

KHE-LCD-SGD-00039 120
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

REFERENCES

Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.

Lane, DM. (n.d.). Sampling Distribution of the Mean. Retrieved from


http://onlinestatbook.com/2/sampling_distributions/samp_dist_mean.html

Rumsey, DJ. (n.d.). How to find the Sampling Distribution of a Sample Proportion.
Retrieved from https://www.dummies.com/education/math/statistics/how-
to-find-the-sampling-distribution-of-a-sample-proportion

Sharma, JK. (2014). Business Statistics. India: Vikas Publishing House Pvt Ltd.

Stine, R. & Foster, D. (2014). Statistics for Business. USA: Pearson Education
Ltd.

Weiss, N. (2017). Introductory Statistics. England: Pearson Education Ltd.

Weisstein, EW. (2018). Central Limit Theorem. Retrieved


from http://mathworld.wolfram.com/CentralLimitTheorem.html

KHE-LCD-SGD-00039 121
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Topic 7: Confidence Interval

In the last few topics, we have been discussing the concept of probability. In
particular, we have examined the basic concepts of probability, discrete
probability distributions, continuous probability distributions and sampling
distributions.

With these as our foundation, we are now ready to work on some inferential
statistics. This topic will discuss the concept of Estimation. We will be finding the
Confidence Interval for Population Mean.

Learning Outcomes

The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:

1. Describe the meaning of interval estimate

2. Construct the Confidence Interval for Population Mean when σ is known

3. Construct the Confidence Interval for Population Mean when σ is unknown

KHE-LCD-SGD-00039 122
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

7.1 Overview of Confidence Interval

We will begin this topic by discussing the concept of confidence interval.

Point Estimation involves the use of sample data to calculate a single value
(known as a statistic) which is to serve as a "best guess" or "best estimate" of an
unknown population parameter (Asadoorian & Kantarelis, 2008).

For example, to estimate the mean age of Kaplan students (parameter µ), we
could use the mean age of our class (say X  18.6 ) as the point estimate. In other
words, based on the sample, we guess that the population mean is 18.6.

Interval Estimation is the use of sample data to calculate an interval of possible


values of an unknown population parameter (Weiss 2017), in contrast to point
estimation, which is a single number. For example, we could guess that the
mean age of Kaplan students is between 18.3 and 18.9 instead of using a single
value 18.6.

Interval Estimate
https://www.youtube.com/watch?v=tFWsuO9f74o

KHE-LCD-SGD-00039 123
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

A Confidence Interval (CI) is a type of interval estimate of a population


parameter and is used to indicate the reliability of an estimate. It is an observed
interval. That is, it is calculated based on observations.

The CIs are different from sample to sample but frequently include the parameter
of interest. How frequently the observed intervals contain the parameter is
determined by the Confidence Level (Stine & Foster, 2014). The common
confidence levels are 90%, 95% and 99%.

Example:

A 95% CI of the population mean (µ) is a range with a lower and upper limits
calculated from a sample. This is used as an interval estimate of µ.

As the true µ is unknown, this range describes possible values that µ could be.
As this is an estimate, the actual µ may or may not be within this interval.

Supposedly we took infinitely many samples (of same size n) and calculated the
95% CI for each sample. We would expect 95% of these CIs will contain µ (NCBI,
2016).

Introduction to Confidence Interval


https://www.youtube.com/watch?v=MbXThbTSrVI

KHE-LCD-SGD-00039 124
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Associate with the confidence level is the Significance Level α. A 95%


confidence interval will have a significant level of 0.05.

Determine the significant levels for the following:

90% CI, α = __________

99% CI, α = __________

We shall now learn how to construct the Confidence Interval for Population
Mean. It turns out that to find the CI for µ, we would need divide into the following
cases:

 when the population standard deviation is known

 when the population standard deviation is unknown

Recall that the symbol for population standard deviation is σ. Check that you are
familiar with the symbols used for all the population parameters and sample
statistics (see Topic 2). We will be referring to all these symbols frequently from
this topic onward.

KHE-LCD-SGD-00039 125
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

7.2 Confidence Interval for μ (σ Known)

To find the CI for µ when σ is Known, we can use the following formula
(Berenson, 2015):

σ
X  Z/2
n

In order to use this formula, we would require the sampling distribution of the
sample mean to be normal. Recall that we have discussed two theories in the
previous topic that could ensure that X is normally distributed.

We are already familiar with the following symbols used in the formula:

 X: Sample mean
 σ: Population standard deviation
 n: Sample size
 Z: Represent standard normal distribution
 α: Significant level

Nevertheless, we have not discussed the meaning of the symbol Z α/2. The
symbol Z α/2 mentioned in the formula is known as Critical Value. This is defined
as the Z value when the right area under the standard normal curve is α/2. For
example, to find the 95% CI, α is equal to 0.05. As such, α/2 = 0.025. The critical
value Z0.025 represents the Z value when the right area is 0.025.

KHE-LCD-SGD-00039 126
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Recall that the Z table only provides values for the left areas of respective Z
values. As such, to find Z0.025 , we firstly note that the left area is 0.975. We will
then look at the Z table for the value representing left area equal to 0.975. This
will give us a Z value of 1.96. Therefore, Z0.025 = 1.96. Alternatively, we can also
look for the Z value when the left area is 0.025. After that, we need to reflect to
the positive side to obtain Z0.025 .

We may not always able to find the exact value of the desired left area in the Z
table. In such situations, we shall use the closest value as an approximation.

Determine Z0.05. You will discover that the adjacent values of 0.05 are equally
close to 0.05. Use the midpoint value in such cases.

The following example illustrates how we could Construct the Confidence


Interval for population mean when the population standard deviation is
known.

Our first task is to identify all the numerical values and assign appropriate
symbols to them. Since σ known, we can use the above formula with Z as the
critical value. Note that before we use the formula, we need to confirm that the
sampling distribution of X is normal.

KHE-LCD-SGD-00039 127
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example:

A sample of 11 circuits from a large normal population has a mean resistance of


2.20 ohms. We know from past testing that the population standard deviation is
0.35 ohms.

Determine a 95% confidence interval for the mean resistance of the population.

Firstly, we summarise the information provided in symbols:

n  11, X  2.20,   0.35,   0.05

Next, we verify that the conditions are met before we could use the formula:

 σ Known

 𝑋 𝑁𝑜𝑟𝑚𝑎𝑙 𝑡ℎ𝑒𝑛 𝑋 𝑁𝑜𝑟𝑚𝑎𝑙

Therefore, 95% confidence interval for µ:

σ
X  Z/2
n
 2.20  1.96 (0.35/ 11)
 2.20  0.207
1.993 ohms    2.407 ohms

CI for µ (σ known)
LMS Learning Outcome 7.1
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 128
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

A sample of 40 circuits from a large population has a mean resistance of 4.23


ohms. We know from past testing that the population standard deviation is 0.21
ohms. Determine a 90% confidence interval for the true mean resistance of the
population.

n  ____, X  ______,   ______,   ______

CI for µ (σ known)
https://www.youtube.com/watch?v=czdwHU27OqA

KHE-LCD-SGD-00039 129
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

7.3 Confidence Interval for μ (σ Unknown)

We have examined the case when the population standard deviation is known.
We shall now discuss how we could construct the Confidence Interval when
Population Standard Deviation is Unknown.

To construct CI for µ when σ is unknown, we can use the following formula (Lind,
Marchal & Wathen, 2018):

To use the above formula, we would need the underlying population to be


normally distributed. You may recall that in the previous case when σ known,
we only need the sampling distribution of X to be normal. This is a weaker
condition than the current one.

We need to introduce a new continuous probability distribution known as t


distribution. The degree of freedom for the t distribution for this formula is n – 1.

KHE-LCD-SGD-00039 130
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The t-distribution is a family of continuous probability distributions that is very


similar to the Z distribution. The t-distribution curves are symmetrical, bell-shaped
and with zero as the centre. It is like the standard normal distribution but have
heavier tails.

As the degree of freedom increases, the t-distribution curves will get thinner and
more like the Z curve. Hence, the t values will be approximately equal to the Z
value when the degrees of freedom are large (Magnusson, n.d.).

The following example illustrates the Meaning of the Critical Value t / 2 . Similar
to the critical value Z/ 2 , the critical value t / 2 represent the t value when the
right area under the t curve from this value is α/2.

Example:

Find the critical value for 90% confidence interval when the sample size is 10.

n = 10 then df = n -1 = 10 – 1 = 9

α = 0.1, α/2 = 0.05

Hence, t0.05 = 1.833

KHE-LCD-SGD-00039 131
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Find t0.025 when the degree of freedom is 24.

How about t0.025 when the degree of freedom is 40?

Did you notice that the t values are decreasing as the degrees of freedom
increase? What do you think would be the value of t0.025 when the degree of
freedom is 1000?

KHE-LCD-SGD-00039 132
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Now that we are able to find the critical value for t distribution, we can proceed to
determine the Confidence Interval for Population Mean when the population
standard deviation is unknown.

Example:

A random sample of 25 Kaplan students was selected. Their average weight is


62.3 kg and the standard deviation is 5.4 kg. Assuming that the weights of
students are normally distributed, what is the 95% confidence interval for the
mean weight?

Our first task is to identify all the numerical values and assign appropriate
symbols to them:

n = 25, 𝑥̅ = 62.3, s = 5.4, df = 25 - 1 = 24, t0.025 = 2.064

We need to verify that the conditions are met before we could use the formula:

• σ Unknown (S = 5.4)
• Population normal

Therefore, 95% confidence interval for µ:

S
X  tα/ 2
n
 62.3  2.064 (5.4/ 25 )
 62.3  2.23
60.07 kg    64.53 kg

CI for µ (σ Unknown)
LMS Learning Outcome 7.2
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 133
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

A random sample of 20 bags of sugar was drawn to check on its weight. The
measurement showed that the mean is 1.02kg and the standard deviation is
0.1kg. It is assumed that the weight is normally distributed. Form a 99%
confidence interval for μ.

n  ____, X  ______, S  ______,   ______

CI for µ (σ Unknown)
https://www.youtube.com/watch?annotation_id=annotation_100657&featu
re=iv&src_vid=_NGYJxrUGgQ&v=bFefxSE5bmo

KHE-LCD-SGD-00039 134
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Summary

Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.

 Interpretation of interval estimate

 Confidence Interval for µ when σ known

 Confidence Interval for µ when σ unknown

KHE-LCD-SGD-00039 135
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

REFERENCES

Asadoorian, M. & Kantarelis, D. (2008). Essentials of Inferential Statistics. USA:


University Press of America Inc.

Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.

Lind, DA., Marchal WG., & Wathen, SA. (2018). Statistical Techniques in
Business and Economics. USA: McGraw-Hill Education.

Magnusson, K. (n.d.). Understanding the t-distribution and its normal


approximation. Retrieved from http://rpsychologist.com/d3/tdist

NCBI. (2016). How do I interpret a confidence interval?. Retrieved from


https://www.ncbi.nlm.nih.gov/pubmed/27184382

Stine, R. & Foster, D. (2014). Statistics for Business. USA: Pearson Education
Ltd.

Weiss, N. (2017). Introductory Statistics. UK: Pearson Education Ltd.

KHE-LCD-SGD-00039 136
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Topic 8: Hypothesis Testing on Population Mean

In the last topic, we discussed the inference of population mean by interval


estimation. This was done by constructing confidence intervals.

We shall now move on to another concept in inferential statistics known as


hypothesis testing. In this topic, we will examine Hypothesis Testing on
Population Mean.

We will discuss some basic concepts of hypothesis testing first before we


introduce two very common hypothesis tests. The underlining theories for this
topic are the same as the previous topic. Therefore, make sure that you have
fully understood the previous topic before you begin your study on this one.

Learning Outcomes

The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:

1. Explain the five steps of hypothesis testing

2. Perform Z-Test

3. Perform t-Test

KHE-LCD-SGD-00039 137
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

8.1 Basic Concept of Hypothesis Testing

We will begin this topic by discussing a few Concepts in Hypothesis Testing.


Once that is done, we will string it up into a five-step procedure for hypothesis
test.

A hypothesis test is a method of making decisions on the population parameter


using sample statistics from collected data. A hypothesis is a claim about a
population parameter which we could not determine whether it is true or false
(Lind, 2018).

Example:

We claim that the mean monthly handphone bills of Kaplan students is $50
(i.e. µ = 50).

Since we do not have the data of all Kaplan students’ handphone bills, we are
not able to determine the real value of the population mean.

Hence, the claim is a hypothesis since we are not able to verify whether it is true
or false.

In hypothesis test, we need to set up two hypotheses:

 Null Hypothesis, denoted by H0 and

 Alternative Hypothesis, denoted by H1

The alternative hypothesis is like our standby hypothesis. In the event that we
reject the null hypothesis, we will accept the alternative hypothesis.

Note that the hypotheses will be on the population parameters (in our case, it is
on µ). It will never be on the sample statistics.

KHE-LCD-SGD-00039 138
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

To Set Up the Hypotheses, we start off with the business claim. If the business
claim contains the ‘=’ sign (i.e. =, ≤ or ≥), we will place it at the null hypothesis.
Otherwise, it will be placed at the alternative hypothesis. The alternative
hypothesis will always be the complement of the null hypothesis. That is, when
we combine the null and alternative hypothesis, we will get the set of all real
numbers (Berenson, 2015).

Using the above guide, we can set up one of the following three hypotheses.
Note that the value ‘50’ is just an example. The actual value may be another
number depending on the business claim.

Observe that in the null hypothesis, we always have the ‘=’ sign (i.e. =, ≤ or ≥).
And in the alternative hypothesis, we always do not have the ‘=’ sign (i.e. ≠, <
or >). In addition, when we combine the null and alternative hypotheses, we will
obtain the set of all real numbers.

Let’s take a closer look at the alternative hypothesis. The alternative


hypothesis will determine the tail of the test. When

 H1 contains the ‘<’ sign, we say it is a left-tailed test

 H1 contains the ‘>’ sign, we say it is a right-tailed test

 H1 contains the ‘≠’ sign, we say it is a two-tailed test

We will discuss the meaning of these tails later. At the moment, it suffices to be
able to determine the appropriate tail of the test.

Setting Up Hypothesis
https://www.youtube.com/watch?v=R2hxisYFKxM

KHE-LCD-SGD-00039 139
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The following are two examples that illustrate how we could set up the null and
alternative hypotheses.

Example 1:

Claim: The average Statistics exam score of Kaplan students is 64.3.

(i.e. µ = 64.3)

H0: µ = 64.3 (claim)

H1: µ ≠ 64.3

(two-tailed)

In Example 1, the claim is ‘µ = 64.3’. Since this claim contains the ‘=’ sign, we
placed it in H0. Therefore, H1 will be the opposite which is ‘µ ≠ 64.3’. Since the
sign in H1 is ‘≠’, this is a two-tailed test.

Example 2:

Claim: The average Statistics exam score of Kaplan students is more than 64.3.

(i.e. µ > 64.3)

H0: µ ≤ 64.3

H1: µ > 64.3 (claim)

(right-tailed)

In Example 2, the claim is ‘µ > 64.3’. Since this claim does not contain the ‘=’ sign,
we have placed it in H1. Therefore, H0 will be the opposite which is ‘µ ≤ 64.3’.
Since the sign in H1 is ‘>’, this is a right-tailed test.

KHE-LCD-SGD-00039 140
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Set up the appropriate hypotheses for the following statements:

1. Determine whether the average weekly pocket money of Kaplan students is


more than $100.

Claim: _______________________________________________

(i.e. )

H0:

H1:

( -tailed)

KHE-LCD-SGD-00039 141
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

2. Is there sufficient evidence to conclude that the average monthly transport


cost for Kaplan students is less than $53.50.

Claim: _______________________________________________

(i.e. )

3. Last year, the average GPA for Kaplan graduates is 3.2. It is believed that the
GPA for this year graduates has significantly changed.

KHE-LCD-SGD-00039 142
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Once we have set up the hypotheses, we wish to determine whether the null
hypothesis is true or false. To do so, we will collect a sample from the specified
population.

For example, if we claim that the mean monthly handphone bills of Kaplan
students is $50, we can randomly sample 100 students and calculate the sample
mean. If X = 20 then it is difficult to believe that µ = 50. If X = 40 then it is more
likely to believe that µ = 50. If X = 48 then it is very likely that µ = 50.

So, when the value of X gets nearer to the proposed value of µ, we are more
likely to believe that the null hypothesis is true. Exactly at what point between 20
and 50 that we begin to believe that the population mean is believable to be 50?
This cut-off point is known as the Critical Value (Asadoorian & Kantarelis, 2008).
The following picture illustrates the idea of critical values:

KHE-LCD-SGD-00039 143
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

If we are performing a left-tailed test, the small left area will be shaded and the
critical value is on the left. Vice-versa, for right-tailed test, the shaded area is on
the right and the critical value is on the right. For a two-tailed test, there are two
shaded areas and critical values as shown above.

The shaded portion is known as the Rejection Region. We will discuss this in
more detail after we have introduced the concept of type I & II errors.

Possible Hypothesis Test Outcomes

Actual Situation

Decision H0 True H0 False

Accept H0 No Error Type II Error

Probability 1 ‐ α Probability β

Reject H0 Type I Error No Error

Probability α Probability 1 ‐ β

Recall that in hypothesis testing, we collect a sample to determine whether the


null hypothesis is believable. This is an inference process. Even if we get the
value of the sample mean very close to the propose value of the population
mean, we can only guess that the null hypothesis is possibly true. We can never
be 100% sure that our conclusion is correct since we are only using a sample to
draw the conclusion.

Therefore, regardless of which decision we make, we are bound to commit an


error. We define the error as follows:

 Type I Error: Reject H0 wrongly


 Type II Error: Accept H0 wrongly

The probability of committing type I error is denoted by the symbol α. This is


exactly the same symbol as the one we use in confidence intervals. The
probability of committing type II error is denoted by the symbol β.

KHE-LCD-SGD-00039 144
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

In hypothesis testing, we usually avoid Type II error by using ‘do not reject H0’
instead of ‘accept H0’. However, we could not avoid type I error since our starting
point is to suspect that H0 may not be true (i.e. to reject H0). Therefore, in
hypothesis testing, there will always be type I error and we will name α as the
Level of Significance of the test (Frost, 2017).

The total area of the shaded tail(s) (left, right or two tails) will be α. So, in the
case of two-tailed test, each tail will be α /2. We can then find the respective
critical values. For example, if the curve is a Z curve, then the critical values for a
two-tailed test are  Z / 2 .

We will learn how to compute a value known as the test statistic later. If this
value falls inside the shaded region, we reject H0. Otherwise, we do not reject H0.

In summary, the following are the only two possible conclusions for hypothesis
testing:

 Reject H0, accept H1

 Do not reject H0, cannot accept H1

KHE-LCD-SGD-00039 145
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

We will be introducing two hypothesis tests (Z test and t test) in this topic. These
tests are for the population mean.

To determine which test to use depends on the population standard deviation.


Here is the guide:

 If the population standard deviation is known, we will use Z test.

 If the population standard deviation is unknown, we will use t test.

You may observe that this is similar to the case when we construct the
confidence interval for population mean. We will discuss each test in detail later.

KHE-LCD-SGD-00039 146
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Now that we have introduced all the necessary concepts, we can summarize it by
stating the five steps involved in any hypothesis testing. Note down these five
steps and keep this in mind when we discuss a few examples of hypothesis tests.
We call this approach of conducting the test as Critical Value Approach since
we make use of the critical value(s) to determine the conclusion of the test.

• Step 1: Hypothesis

• Step 2: α

• Step 3: Test Statistics

• Step 4: Rejection Rule (draw picture)

• Step 5: Conclusion

Overview of Hypothesis Testing


https://www.youtube.com/watch?v=DlwOTOydeyk

KHE-LCD-SGD-00039 147
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

8.2 Hypothesis Testing for μ (σ Known)

We shall now introduce the Hypothesis Testing for Population Mean when
the Population Standard Deviation is Known (Weiss, 2017). The test statistic
for this case is as follows:

To perform the Z test, we require the sampling distribution of the sample mean to
be normally distributed. This can be achieved if either the underling population is
normally distributed or the sample size is large. Note that this is the same
condition as the case of finding confidence interval for µ when σ is known.

KHE-LCD-SGD-00039 148
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example:

Test the claim, at 0.05 significant level, that the mean monthly handphone bills of
Kaplan students is $50. From a sample of 100 students, we obtained a mean of
$52.3. You may assume that σ = $4.8

We will use the five steps mentioned earlier to perform the hypothesis testing.

Step 1:

We will start with a claim and formulate it into mathematical representation. We


use the same approach as the activity we did earlier to set up the hypotheses H0
and H1.

Claim: the mean monthly handphone bills of Kaplan students is $50 (i.e. µ = 50)

H0: μ = 50 (claim)
H1: μ ≠ 50
(two-tailed)

Step 2:

This step is the easiest. We just need to decide on the level of significance.

α = 0.05.

KHE-LCD-SGD-00039 149
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Step 3:

In this step, we have to decide on an appropriate test statistics. Since we are


testing µ, we need to check whether the population standard deviation (σ) is
known. In this case, since it is known, we would like to use the Z test. However,
before we proceed further, we need to verify that the sampling distribution of X
is normal. This was indeed true for our case since the sample size is large.
Finally, we compute the test statistics using the formula mentioned earlier.

 σ is known

 X is not normal but n = 100 ≥ 30, so 𝑋 is normal

Therefore use Z-test.

X μ
Z STAT 
σ
n
52.3  50
  4.8
4.8
100

Step 4:

In this step we need to set up a rejection rule. One way to do so is to present it


using the Z curve. Since we are doing a two-tailed test, the area on each tail is
0.025 (α /2). Look up the Z table and you can determine that the critical values
are -1.96 and 1.96. Next, we have indicated the computed test statistics on the
graph. Since the value 4.8 is more than 1.96, we indicated the value inside the
rejection region.

KHE-LCD-SGD-00039 150
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Step 5:
This is the final step in which we draw a conclusion. Since the value of the test
statistics is in the rejection region, we reject H0 and accept H1. We would now
need to explain the test conclusion in the perspective of our business case.
Recall that our claim is in H0. Therefore, rejecting H0 implies that we reject our
claim. Hence, we have added the last statement as shown below to conclude the
test.

We reject H0, accept H1.

Therefore, we reject that the mean monthly handphone bills of Kaplan


students is $50.

Z - test
LMS Learning Outcome 8.1
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 151
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Test the claim, at 0.01 significant level, that the population mean of a bar of
chocolate is 250g. You may assumed that the weight is normally distributed with
a population standard deviation of 30g.

A sample of 20 bars of chocolate has shown that the average weight is 240g.

KHE-LCD-SGD-00039 152
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

8.3 Hypothesis Testing for μ (σ Unknown)

We shall now introduce the Hypothesis Testing for Population Mean when
the Population Standard Deviation is unknown (Stine, 2014). The test statistic
for this case is as follows:

To perform the t test, we require the underlining population to be normally


distributed. This condition is more demanding than the case when σ is known.
For the case when σ is known, we only require the sampling distribution of
sample mean to be normally distributed. In addition, when determining the critical
value in the t table, we would need to note that the degree of freedom is n - 1.
Note that these are similar to the case of finding confidence interval for µ when σ
is unknown.

KHE-LCD-SGD-00039 153
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example:

It is believed that Kaplan students’ performance in Statistics module has


improved. This year average exam score is over the last year average of 64.3.
To verify this, the results of 25 students were examined and their average score
was 64.8 with standard deviation of 5.2. Test whether the claim is true at 0.10
significant level. You may assume that the scores are normally distributed.

Step 1:

We stated the claim and formulated it into mathematical representation. We then


set up the hypotheses H0 and H1 as shown below.

Claim: Kaplan students’ performance in Statistics module has improved

(i.e. µ > 64.3)

H0: μ ≤ 64.3

H1: μ > 64.3 (claim)

(right-tailed)

Step 2: We state the level of significance.

 = 0.10

KHE-LCD-SGD-00039 154
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Step 3:

In this step, we have to decide on appropriate test statistics. Since we are testing
µ, we need to check whether the population standard deviation (σ) is known. In
this case, since it is unknown, we would like to use the t test. However, before we
can proceed further, we need to verify that the underlining population is normally
distributed. This was indeed true since the question has indicated that the scores
are normally distributed.

 σ is Unknown (S = 5.2)

 X is normal

Therefore use t-test.

X  μ
t STAT 
S
n
64.8  64.3
  0 .5
5.2
25

Step 4:

In this step, we will present the rejection rule using the t curve. Since we are
doing a right-tailed test, the area of the right tail is 0.10. Look up the t table and
we can determine that the critical value is 1.318. Remember that for the t test,
the degree of freedom is n – 1 (i.e. 25 – 1 = 24). Next, we indicate the computed
test statistics into the graph. Since the value 0.5 is less than 1.318, we indicated
the value outside the rejection region.

KHE-LCD-SGD-00039 155
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Step 5:

Since the value of the test statistics is not in the rejection region, we do not reject
H0 and cannot accept H1. Remember that we would need to explain the test
conclusion in the perspective of our business case.

Since the claim is in H1 and we cannot accept H1, it implies that we cannot
accept the claim. Hence, we have added the last statement as shown below to
conclude the test.

We do not reject H0, cannot accept H1.

Therefore, we cannot accept that Kaplan students’ performance in


Statistics module has improved.

t - test
LMS Learning Outcome 8.2
https://elearn-diploma.kaplan.com.sg

KHE-LCD-SGD-00039 156
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The SMRT claim that the mean waiting time for train during the morning peak
hour is less than 5 minutes.

A sample of 20 students were interviewed and their mean waiting time was 5.2
minutes with a sample standard deviation of 0.8 minutes.

Test the appropriate hypotheses at  = 0.01.

You may assume that the waiting time for train is normally distributed.

KHE-LCD-SGD-00039 157
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Summary

Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.

 Hypothesis testing for µ when σ known

 Hypothesis testing for µ when σ unknown

KHE-LCD-SGD-00039 158
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

REFERENCES

Asadoorian, M. & Kantarelis, D. (2008). Essentials of Inferential Statistics. USA:


University Press of America Inc.

Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.

Frost, J. (2016). Significance level. Retrieved from


http://statisticsbyjim.com/glossary/significance-level/

Lind, DA., Marchal WG., & Wathen, SA. (2018). Statistical Techniques in
Business and Economics. USA: McGraw-Hill Education.

Stine, R. & Foster, D. (2014). Statistics for Business. USA: Pearson Education
Ltd.

Weiss, N. (2017). Introductory Statistics. UK: Pearson Education Ltd.

KHE-LCD-SGD-00039 159
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Topic 9: Hypothesis Testing on


Difference of Two Population Means

In the previous topic, we have discussed two hypothesis tests on population


mean. In these cases, we are considering only one population. At times, we may
want to compare the population means of two populations. To do so, we would
need to examine Hypothesis Testing on Difference of Two Population
Means.

Learning Outcomes

The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:

1. Perform Paired t-Test

2. Perform Independent Z-Test

3. Perform Pooled t-Test

4. Perform Non-pooled t-Test

KHE-LCD-SGD-00039 160
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

9.1 Setting Hypothesis for Two Population Means

In order to compare the mean of two populations, we will examine the difference
between the mean (Brenson, 2015). Suppose that we have a population with
mean µ1 and another population with mean µ2. We will examine the difference µ1
- µ2. In that case, we could set up one of the following three hypotheses:

Example:

Xtrem Slimming Pte Ltd has introduced a new slimming pills which claims to be
effective in reducing body weights. Ten participants tried the new pills for one
month and their weights were measured. Do you think the new slimming product
is effective?

Claim: The new slimming product is effective.

(i.e. µafter < µbefore so µafter - µbefore < 0)

H0: µafter - µbefore ≥ 0

H1: µafter - µbefore < 0 (claim)

(left-tailed)

KHE-LCD-SGD-00039 161
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Set up the appropriate hypotheses for the following statements:

1. Determine if there is sufficient evidence to conclude that the mean age of


married men is more than the mean age of married women.

Claim: _______________________________________________

2. Some people suspect that the IQ scores of Kaplan students is different from
SIM students.

Claim:

KHE-LCD-SGD-00039 162
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Testing of Difference Between Two Population Means

1. Paired t-Test

2. Independent Z-Test: σ1 & σ2 k nown

3. Pooled t-Test: σ1 & σ2 unknown but equal

4. Nonpooled t-Test: σ1 & σ2 unknown and not equal

There are four hypothesis tests for the difference between two population means.
To determine which tests to use, we will consider the following:

1. Is the data related? Can the data be paired? If yes, we will use Paired t-
Test
2. If the data are independent, then we need to check if the population
standard deviations are known:
a. If the population standard deviations are known, we use
Independent Z-Test
b. If the population standard deviations are unknown, we need to
estimate whether these are equal (we will discuss how later):
i. If the population standard deviations are unknown but equal,
we use Pooled t-Test
ii. If the population standard deviations are unknown but not
equal, we use Nonpooled t-Test

We will discuss each of these tests in the following pages. Do a quick recall of
the five steps in hypothesis testing as we will be using it in all these tests.

KHE-LCD-SGD-00039 163
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

9.2 Paired t-Test

The first test we are discussing is the Paired t-Test (Asadoorian & Kantarelis,
2008). This is used when the data could be paired. That is, for the experiments
that we do, we could pair the measurements for each experiment by a common
attribute.

Example:

This is an example that explain the concept of pairing. Supposedly, we want to


compare the difference of age between husbands and wives. We will have the
first population consisting of all husbands. We will have a second population
consisting of all wives. The best way to examine the differences in age between
husbands and wives is to have a common attribute (i.e. same couple) when
taking measurements. The age of a husband of the same couple will represent
the sample data from the first population. The age of the wife of the same couple
will represent the sample data from the second population.

Difference of age between husbands and wives:

Couple Husband Wife Difference, d


1 59 53 6
2 21 22 -1
3 33 36 -3
4 78 74 4
5 70 64 6
6 33 35 -2
7 68 67 1
8 32 28 4
9 54 41 13
10 52 44 8
36

KHE-LCD-SGD-00039 164
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Let's take a look at another example. Consider the situation when we want to
determine whether a brand of slimming pills is effective on adults. We will have
the first population consists of all adults who have not taken the pills. We will
have a second population consists of all adults who have taken the pills. The
best way to examine the effectiveness of the pills is to observe how much weight
a person has lost after taking the pills. To this end, we can have a common
attribute (i.e. same person) when taking measurements. The weight of the same
person before taking the pill will represent the sample data from the first
population. The weight of the same after taking the pill will represent the sample
data from the second population.

Once we have paired the measurements as shown above, we compute the


differences (denoted by d). Now, we have a single sample consisting of the
differences (d) and we can perform the t-test with the test statistics shown below.
Note the conditions required and the degree of freedom for the t distribution is n
– 1.

KHE-LCD-SGD-00039 165
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example:

A random sample of 10 married couples gave the data on ages, in years. At the
10% significant level, do the data provide sufficient evidence to conclude that the
mean age of married men is more than the mean age of married women?

Couple Husband Wife Difference, d

1 59 53 6

2 21 22 -1

3 33 36 -3

4 78 74 4

5 70 64 6

6 33 35 -2

7 68 67 1

8 32 28 4

9 54 41 13

10 52 44 8

36

KHE-LCD-SGD-00039 166
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

This example explains how we could carry out a paired t-test.

Step 1:

As usual, we start off by stating the claim and then formulate it into mathematical
symbols. We would need to adjust the symbols so that the parameters are on the
left side as shown below. The principles used in setting the hypotheses are
exactly the same as those used in the previous topic.

Claim: The mean age of married men is more than the mean age of married
women

(i.e. µM > µW or µM - µW > 0)

H0: µM - µW ≤ 0

H1: µM - µW > 0 (claim)

(right-tailed)

Step 2:  = 0.10

Step 3:

We have to check that the data are paired and population of differences are
normally distributed. We can draw some inference on the normality of the
differences by looking at the histogram. Once we are satisfied with the
assumption, we can compute the test statistic.

KHE-LCD-SGD-00039 167
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

 Paired samples (same couple)

 Assume normal differences

Therefore use paired t-test.

d
t STAT 
Sd
n
3 .6
  2 . 29
4.97
10

Step 4:

Next, we will draw the t curve to determine the rejection region. Remember to
indicate the computed value of the test statistic on the graph.

KHE-LCD-SGD-00039 168
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Step 5:

The final step is the conclusion. Remember to comment on your test result from
the business perspective.

We reject H0 and accept H1.

Therefore, we accept that the mean age of married man is more than the mean
age of married women.

You might have noticed by now that the above procedure is similar to the t test
we had done in the last topic. In fact, the paired t-test is exactly the same t-test
we had done in the previous topic except we have to do the pairing of data and
covert the two-sample data into a one-sample data by taking differences.

Paired t - test
https://www.youtube.com/watch?v=vB1OmEY5Rcw

KHE-LCD-SGD-00039 169
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Xtrem Slimming Pte Ltd has introduced a new slimming pills which claims to be
effective in reducing body weights. Ten participants tried the new pills for one
month and their weights were measured. At 5% significant level, do you think the
new slimming product is effective?

Person Before After


1 59 53
2 82 76
3 60 63
4 78 74
5 70 64
6 58 60
7 68 67
8 80 83
9 71 69
10 58 54

KHE-LCD-SGD-00039 170
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

9.3 Independent Z-Test

The second test we are discussing is the Independent Z-Test (Weiss, 2017).
This is used when the two samples are independent and the population standard
deviations are known.

For example, consider the case where we would like to compare the IQ scores of
Kaplan students and SIM students. We could draw a sample of Kaplan students
and a sample of SIM students. There is no way for us to find a meaningful
common attribute to pair the sample. They are obviously independent.

If we also know the population standard deviation of the IQ scores for Kaplan and
SIM students, we could use the test statistics shown above. Note that we also
require both populations to be normally distributed. The Independent Z-Test is
not commonly use since in most situations we do not know the values of the
population standard deviations. Nevertheless, it is good to run through an
example of this test as it provides a good foundation for discussion of the
remaining two tests.

KHE-LCD-SGD-00039 171
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example:

Test, at 0.05 significant level, the claim that the mean IQ scores of Kaplan
students is different from SIM students. A sample of 20 students from Kaplan
shows an average IQ score of 105.8 while a sample of 28 students from SIM has
an average of 106.3. Assuming that the IQ scores for Kaplan and SIM students
are normally distributed with standard deviations 5.3 and 5.8 respectively.

Step 1:

Claim: The mean IQ scores of Kaplan students is different from SIM students

(i.e. µK ≠ µS so µK - µS ≠ 0)

H0: µK - µS = 0

H1: µK - µS ≠ 0 (claim)

(two-tailed)

Step 2:  = 0.05

Step 3:

 Independent samples

 σ1 and σ2 are known

 Both populations are normal

Therefore use independent Z-test.

KHE-LCD-SGD-00039 172
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

x1  x2 105.8  106.3
Z   0.31
 2
 2 2
5.3 5.8 2
1
 2

n1 n2 20 28

Step 4:

Step 5:

We do not reject H0, cannot accept H1.

Therefore, we cannot accept that the mean IQ scores of Kaplan students is


different from SIM students.

Independent Z - test
https://www.youtube.com/watch?v=EwyxQ-yLSbU

KHE-LCD-SGD-00039 173
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

9.4 Pooled and Non-Pooled t-Test

We have discussed the Independent Z-test. This test is for the case when the
population standard deviations are known. In the event that the population
standard deviations are unknown, we would need to use t-Test. For the t-Test,
we will need to decide whether we want to pool the variance. The following are
the guidelines:

 If the population standard deviations are unknown but equal, use a


Pooled t-Test

 If the population standard deviations are unknown but not equal, use a
Nonpooled t-Test

As a rough estimate, we will consider the population standards as equal if the


ratio of the standard deviations is less than 2. That is, the bigger of the two
standard deviations is less than twice the smaller one:

SBig/SSmall < 2

KHE-LCD-SGD-00039 174
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

9.4.1 Pooled t-Test

The third test we are discussing is the Pooled t-Test (Lind, Marchal & Wathen,
2018). This is used when the two samples are independent and the population
standard deviations are unknown but equal. We will also require both populations
to be normally distributed.

Since both population standard deviations are equal, we will combine the sample
variances and determine the pooled sample standard deviation (denoted by Sp)
using the above formula. After which, we will substitute this value into the test
statistics as shown above. Note that the degree of freedom for the t distribution is
n1 + n2 – 2.

KHE-LCD-SGD-00039 175
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example:

Test, at 0.05 significant level, the claim that the mean IQ scores of Kaplan
students is different from SIM students. A sample of 10 students from Kaplan
shows an average IQ score of 105.8 with a standard deviation of 5.3. Another
sample of 18 students was selected from SIM and the average IQ scores was
106.3 with a standard deviation of 5.8. Assuming that the IQ scores for Kaplan
and SIM students are normally distributed.

Step 1:

Claim: The mean IQ scores of Kaplan students is different from SIM students

(i.e. µK ≠ µS so µK - µS ≠ 0)

H0: µK - µS = 0

H1: µK - µS ≠ 0 (claim)

(two-tailed)

Step 2:  = 0.05

Step 3:

 Independent samples

 σ1 and σ2 are unknown but equal (S2/S1 = 5.8/5.3 = 1.1 < 2)

 Both populations are normal

Therefore use pooled t-test.

KHE-LCD-SGD-00039 176
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

(10  1)  5.32  (18  1)  5.82 105.8  106.3


t  0.23
Sp  1 1
10  18  2 5.63  
 5.63 10 18

Step 4:

Step 5:

We do not reject H0, cannot accept H1.

Therefore, we cannot accept that the mean IQ scores of Kaplan students is


different from SIM students.

Pooled t - test
https://www.youtube.com/watch?v=1F1Y3SuNz9c

KHE-LCD-SGD-00039 177
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

9.4.2 Non-Pooled t-Test

The fourth test we are discussing is the Non-pooled t-Test (Stine, 2014). This is
used when the two samples are independent and the population standard
deviations are unknown but NOT equal. We will also require both populations to
be normally distributed.

You may have observed that the test statistics for the Nonpooled t-Test is very
similar to the Independent Z-Test except that we have switched the population
variances to sample variances. The degree of freedom for the t distribution is
rather complicated as you can observe from the above formula.

KHE-LCD-SGD-00039 178
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example:

Test, at 0.05 significant level, the claim that the mean IQ scores of Kaplan
students is different from SIM students. A sample of 10 students from Kaplan
shows an average IQ score of 105.8 with a standard deviation of 7.3. Another
sample of 18 students was selected from SIM and the average IQ scores was
106.3 with a standard deviation of 2.8. Assuming that the IQ scores for Kaplan
and SIM students are normally distributed.

Step 1:

Claim: The mean IQ scores of Kaplan students is different from SIM students

(i.e. µK ≠ µS so µK - µS ≠ 0)

H0: µK - µS = 0

H1: µK - µS ≠ 0 (claim)

(two-tailed)

Step 2:  = 0.05

Step 3:

 Independent samples

 σ1 and σ2 are unknown and not equal

(S1/S2 = 7.3/2.8 = 2.6)

 Both populations are normal

Therefore use Non-pooled t-test.

KHE-LCD-SGD-00039 179
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

2
105.8  106.3 7.32 2.82 
t  10

18
7.32 2.82 df   
 2
 7.32   2.82 
2

10 18    
10   18 
   
 0.21 10  1 18  1
 10.49  10 (always round down)

Step 4:

Step 5:

We do not reject H0, cannot accept H1.

Therefore, we cannot accept that the mean IQ scores of Kaplan students is


different from SIM students.

KHE-LCD-SGD-00039 180
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

The Statistics exam scores of a group of Kaplan students were recorded and
group by gender as follows:

Gender Male Female

Sample Size 10 12
Mean 68.3 64.2
Standard Deviation 4.8 5.6

Conduct an appropriate hypothesis test, at  = 0.1, to determine whether there is


a difference in Statistics performance between male and female students.
You may assume that the scores for both gender are normally distributed.

KHE-LCD-SGD-00039 181
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Summary

Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.

 Paired t-Test

 Independent Z-Test: σ1 & σ2 known

 Pooled t-Test: σ1 & σ2 unknown but equal

 Nonpooled t-Test: σ1 & σ2 unknown and not equal

KHE-LCD-SGD-00039 182
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

REFERENCES

Asadoorian, M. & Kantarelis, D. (2008). Essentials of Inferential Statistics. USA:


University Press of America Inc.

Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.

Lind, DA., Marchal WG., & Wathen, SA. (2018). Statistical Techniques in
Business and Economics. USA: McGraw-Hill Education.

Stine, R. & Foster, D. (2014). Statistics for Business. USA: Pearson Education
Ltd.

Weiss, N. (2017). Introductory Statistics. UK: Pearson Education Ltd.

KHE-LCD-SGD-00039 183
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Topic 10: Simple Linear Regression

So far we have been discussing the analysis of statistics for one variable. For this
topic, we shall examine the relationship between two variables. In particular,
we are interested in the linear (or straight line) relationship between two
variables.

This discussion can be generalised into multiple variables for more advance
topics in Statistics and are usually done with statistical software. As an
introductory level, we will discuss Regression Analysis and Correlation
Analysis for two variables in this topic.

Learning Outcomes

The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:

1. Construct a scatter plot

2. Determine the simple linear regression equation

3. Interpret the slope of the regression line

4. Compute and interpret correlation coefficient

KHE-LCD-SGD-00039 184
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

10.1 Equation of Straight Line

We shall begin this topic with a quick review on the equation of straight line
(Mathcentre, 2009).

The general equation of straight line is:

y = mx + c

Equation of Straight Line


https://www.youtube.com/watch?v=Jq0fCfpRtV4

Perform an online search for “Equation of Straight Line”.

• What is the meaning of m and c?

• Determine the equation of the following line:

KHE-LCD-SGD-00039 185
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

10.2 Regression Analysis

In a regression model, variables can be classified as dependent variables and


independent variables. An Independent Variable (denoted as X) is also known
as an input variable. The Dependent Variable (denoted as Y) represents the
output or effect possibly caused by the independent variables. That is, Y
depends on X. For example, the dependent variable could be the Statistics
Exam Scores of students and the independent variable could be the Time spent
in preparing for the exam. So, Score depends on Time.

Regression analysis is a statistical technique for estimating the relationships


among variables. It helps us to understand how the dependent variable changes
when any one of the independent variables is varied, while the other independent
variables are held fixed. When we are dealing with only one independent variable
and examining the linear relationship, the topic of discussion will be classified as
Simple Linear Regression Analysis.

Before we conduct a regression analysis, we usually do a Scatter Plot (Chartio,


2018).

Example:

Cost/day
Volume/day (S$)
(L) Cost per Day vs. Production Volume

23 125
250
26 140
200
Cost per Day

29 146 150

33 160 100
50
38 167
0
42 170 20 30 40 50 60 70

50 188 Volume per Day

55 195
60 200

KHE-LCD-SGD-00039 186
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

In this example, the independent variable (X) is volume and should be labelled
on the horizontal axis. The dependent variable (Y) is cost and should be labelled
on the vertical axis. We dot each pair of data on the diagram but do not connect
the points.

A scatter plot is useful tool to identify potential linear (straight line) associations
between the two variables. Occasionally, the scatter plot may not show a
potential linear relationship. In such situation, we may have to do a change of
variable such as considering Y against x, x 2 , ln x , etc.

Scatter Plot
https://www.youtube.com/watch?v=NcgRa0uotXs

A Simple Linear Regression Model shows the straight-line relation between the
dependent variable (Y) and the independent variable (X) with some random
fluctuation around the line (Berenson, Levine & Szabat, 2015).

KHE-LCD-SGD-00039 187
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Supposedly we are able to plot all the possible pairs of (X,Y) on the scatter plot
as shown:

This becomes a scatter plot for the population. If we are able to fit a straight line
through these dots, we can use this line to explain the linear regression model.
Note that Y  0  1 X is an equation of a straight line. The slope of the line is
𝛽 and the y-intercept is 𝛽 .

Note that not all the points fall on the straight line. If you look carefully at the
linear regression model Yi  0  1 Xi  i , it is actually the equation of the
straight line plus an 𝜀 (pronounce as epsilon). This represents a small error.

Therefore, in a simple linear regression model, for an independent value Xi the


observed value Yi may be slightly higher or lower than the value we substitute
into the equation (i.e. 0  1 Xi ). Nevertheless, one assumption of the model is
that  i ~ N (  ,  2
) where σ is some unknown but constant value. This means

that the mean of i is zero and hence the average of Yi will be Y  0  1 Xi .


Therefore, we will regard the value of Y as the average value of all Yi for a
particular Xi .

KHE-LCD-SGD-00039 188
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

In actual practice, we only have a sample of pairs of (X,Y) data. Based on the
sample, we will fit these points into an Estimated Regression Line 𝑌 𝑏
𝑏 𝑋.

Firstly, note that we use b0 and b1 instead of 𝛽 and 𝛽 . This is because we are
using sample data to estimate the regression line. So, b0 and b1 are estimates of
𝛽 and 𝛽 respectively. The symbol 𝑌 (pronounce as Y-hat) represents the
estimated average value of Y.

To obtain the estimated regression line, we need to use a Least Squares


Method. This method essentially explore all possible straight lines and for each
straight line, it determines the sum of squared differences between observed Y
and the estimated 𝑌. The best suitable line to estimate the actual line will be the
one with the least sum of squared differences.

The Least Squares Method

min  (Yi Ŷi )2  min  (Yi  (b0  b1Xi ))2

Introduction to Simple Linear Regression


https://www.youtube.com/watch?v=KsVBBJRb9TE

KHE-LCD-SGD-00039 189
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

We have briefly described the principle of least square method in the previous
page. This method will result in the above formulas that will be used to determine
the values of b0 and b1. The actual derivation of these formulas requires calculus
and is not within this syllabus. It suffices to be able to apply these formulas. We
will illustrate this by an example.

Example:

KFC needs to order oil daily for its fried chicken. In a particular KFC store, the
volume of oil used and the respective cost for nine days were recorded as
shown. Use the data provided to determine the regression equation.

Volume per day (L) Cost per day (S$)


23 125
26 140
29 146
33 160
38 167
42 170
50 188
55 195
60 200

KHE-LCD-SGD-00039 190
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

S xy 2662.7 356 1491


b1    1.9 x  39.6 y  165.7
S xx 1386.2 9 9

b0  y  b1 x
 165.7  1.9  39.6
 90.5

Therefore, the regression equation is:

Yˆ  b0  b1 X
Yˆ  90.5  1.9 X

Determine Regression Equation


https://www.youtube.com/watch?v=CtKeHnfK5uA

KHE-LCD-SGD-00039 191
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

It is believed that students’ performance in Statistics exam has a close relation


with the amount of time they studied for the exam. To determine the relationship,
the Statistics exam scores of 10 students were recorded together with the
respective amount of time (in hours) they spent in preparing for the exam (see
data in next page).

a) Use the data provided to determine the regression equation.

b) Predict the average score of students who have studied for 25 hours.

KHE-LCD-SGD-00039 192
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Student Time Score X2 XY Y2

1 10 32
2 13 35
3 15 40
4 19 49
5 22 53
6 25 58
7 31 64
8 35 70
9 39 74
10 43 81
TOTAL

KHE-LCD-SGD-00039 193
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

We have learnt how to determine the values for b0 and b1 and write down the
linear regression equation. We shall now learn to interpret these values.

Firstly, b0 is the y-intercept of the straight line. As such,

b0 is the estimated average value of Y

when the value of X is zero.

Secondly, b1 is the slope of the straight line. Therefore,

b1 is the change in the estimated value of Y

when X is increase by one unit.

Example:

Continue from previous KFC example:


Yˆ  b0  b1 X
Yˆ  90.5  1.9 X

Interpret the slope of the regression line.

The slope is 1.9. This means that the estimated mean cost is increased by $1.9
for every additional one litre of oil used.

KHE-LCD-SGD-00039 194
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Continue from previous Statistics Exam Scores activity where we have:

Interpret the slope of the regression line.

KHE-LCD-SGD-00039 195
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

10.3 Correlation Analysis

We have learned to obtain the estimated linear regression equation in the


previous section. However, how much are we able to believe in the equation and
use it to estimate the dependent variable? To answer this question, we would
need to conduct a Correlation Analysis (Weiss, 2017).

Correlation analysis measures the relationship between the dependent and


independent variables. For example, we may want to know how strong is the
relationship between exam score and amount of time spent preparing for it.

To achieve this, we will need to introduce a measurement known as Sample


Correlation Coefficient (denoted by r). The formula is as follows:

S xy
r
S xx S yy

The correlation coefficient is between -1 and 1 inclusively.

-1  r  1

The correlation coefficient measures the strength of the straight-line relation


between X and Y. The closer r is to zero the weaker the straight line trend and
vice versa, if r is close to -1 or 1, the linear relationship is strong.

KHE-LCD-SGD-00039 196
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Note that the correlation coefficient only measures the strength of the linear
relationship. The two variables may have a strong non-linear relationship but
the coefficient of correlation could be near to zero.

The following examples illustrated that when the sample data are close to the
straight line, the value for r is nearer to -1 or 1. These are examples of strong
linear correlation.

KHE-LCD-SGD-00039 197
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

When the data are scattered far apart from the straight line, the value of r will be
nearer to zero which indicates a weak linear relationship.

Correlation Coefficient
https://www.youtube.com/watch?v=ugd4k3dC_8Y

KHE-LCD-SGD-00039 198
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Example:

Using the same KFC example discussed earlier, determine and interpret the
correlation coefficient.

Therefore, the correlation coefficient is:

S xy 2662.7
r   0.98
S xx S yy (1386.2  5290)

This means that there is a strong positive linear relationship between the cost
and the volume of oil used.

KHE-LCD-SGD-00039 199
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Continue from the previous Class Activity on Statistics Exam Scores, determine
and interpret the correlation coefficient.

Student Time Score


1 10 32
2 13 35
3 15 40
4 19 49
5 22 53
6 25 58
7 31 64
8 35 70
9 39 74
10 43 81
TOTAL

KHE-LCD-SGD-00039 200
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

Summary

Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.

 Scatter Plot

 Linear Regression Equation

 Interpretation of Slope

 Correlation Coefficient

 Interpretation of Correlation Coefficient

KHE-LCD-SGD-00039 201
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING

REFERENCES

Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.

Chartio. (2018). What is a Scatter Plot and When to Use It. Retrieved from
https://chartio.com/learn/dashboards-and-charts/what-is-a-scatter-plot/

Mathcentre. (2009). Equations of Straight Lines. Retrieved from


http://www.mathcentre.ac.uk/resources/uploaded/mc-ty-strtlines-2009-
1.pdf

Weiss, N. (2017). Introductory Statistics. UK: Pearson Education Ltd.

KHE-LCD-SGD-00039 202

You might also like