BSDDDM Study Guide v2.0-2
BSDDDM Study Guide v2.0-2
KHE-LCD-SGD-00039 ii
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Table of Contents
Message to Student
Kaplan Desired Graduate Attributes ii
Table of Contents iii
About this module iv
Instructions to Students v
Scheme of Work vi
Assessment Matters x
Topic 1
Present Data 1
Topic 2
Numerical Measures 24
Topic 3
Probability 46
Topic 4
Discrete Probability Distributions 70
Topic 5
Normal Distributions 93
Topic 6
Sampling Distribution 107
Topic 7
Confidence Interval 122
Topic 8
Hypothesis Testing on Population Mean 137
Topic 9
Hypothesis Testing on Difference of Two Population Means 160
Topic 10
Simple Linear Regression 184
KHE-LCD-SGD-00039
iii
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Other sources:
KHE-LCD-SGD-00039 iv
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Instructions to Students
PowerPoint Slides
KHE-LCD-SGD-00039
v
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Scheme of Work
SESSION
TOPICS
FT
1 Topic 01 Present Data
• Basic Concepts
• Data Collection
• Present Qualitative Data
• Present Numerical Data
KHE-LCD-SGD-00039
vi
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Scheme of Work
SESSION
TOPICS
FT
13 Recap topics 6-10
Module Consolidation
KHE-LCD-SGD-00039
vii
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Scheme of Work
SESSION
TOPICS
PT
1 Topic 01 Present Data
• Basic Concepts
• Data Collection
• Present Qualitative Data
• Present Numerical Data
KHE-LCD-SGD-00039
viii
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Scheme of Work
SESSION
TOPICS
PT
6 Topic 09 Hypothesis Testing on Difference of Two Population Means
• Paired t-Test
• Independent Z-test
• Pooled t-Test
• Non-pooled t-Test
Exam Briefing
Module Consolidation
KHE-LCD-SGD-00039
ix
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Assessment Matters
TOP TIP:
The surest way to succeed is to ensure all work is
correctly referenced. Keep a copy of the Kaplan
Singapore Academic Works and APA
Guide handy when you are typing your
assignments and use it to guide you as to
correct referencing, citation and other aspects of
academic writing.
KHE-LCD-SGD-00039
x
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
KHE-LCD-SGD-00039 xi
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Topic 1: Introduction to Accounting
Topic 2: Double Entry Book-keeping and Trial Balance
Topic 3: Final Day Adjustments
Topic 4: Preparation of Basic Financial Statements
Topic 5: Ratios Analysis
Topic 6: Introductiont to Management Accounting
Topic 7: Budgeting
Topic 8: Standard Costing and Variances
Topic 9: Cost Volume Profit Analysis and Decision Making
Topic 10: Capital Budgeting
As we go through this module, you will find plenty of occasions in life and
business which would require statistics. The more you dwell into the world of
statistics, the more exciting the journey will get!
Learning Outcomes
The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:
KHE-LCD-SGD-00039 1
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
There are two broad studies of Statistics – Descriptive Statistics and Inferential
Statistics (Weiss, 2017).
Descriptive statistics provides simple summaries about the data collected and
about the preliminary observations. Such summaries may be either quantitative
(numerical measures) or visual (e.g. simple-to-understand graphs). These
summaries may either form the basis of the initial description of the data as part
of a more extensive statistical analysis, or they may be sufficient in and of
themselves for a particular investigation.
KHE-LCD-SGD-00039 2
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
KHE-LCD-SGD-00039 3
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Population vs Sample
LMS Learning Outcome 1.1
https://elearn-diploma.kaplan.com.sg
KHE-LCD-SGD-00039 4
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Go through these questions and tally your answers with the lecturer.
3. Kaka Ltd has 9,000 workers. 360 staff members were polled regarding a new
wage package to be submitted to management. The population is the 360
members.
True False
KHE-LCD-SGD-00039 5
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In statistics, we begin with the list of Variables we would like to analyze. The
answers that we collected are known as Data. A variable is a characteristic being
observed that may assume more than one of a set of values.
Example of variable: Preferred colour of mobile phones
Example of data: black, while, red, blue, yellow
Variable/Data
Qualitative Quantitative
(Categorical) (Numerical )
Discrete Continuous
Variables can further divide into Qualitative (or Categorical) and Quantitative
(or numerical) types (Australian Bureau of Statistics, n.d.).
Quantitative variables can further divide into Discrete and Continuous types
(Stephanie, 2018).
Types of Variables/Data
LMS Learning Outcome 1.2
https://elearn-diploma.kaplan.com.sg
KHE-LCD-SGD-00039 6
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Discuss among your classmates on the types of variables for the following. A
good way to work on this is to imagine what kind of data you will be collecting for
each of this variable.
Indicate
4. Age of students
6. Unit price
7. GDP Growth
9. Revenue
KHE-LCD-SGD-00039 7
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In this section, we will discuss the issues related to data collection. In particular,
we will discuss the ‘Sources of Data’ and ‘Sampling Methods’.
Secondary data refers to data that was collected by someone other than
the researcher. Secondary sources offer interpretation or analysis based
on primary sources. They may explain primary sources and often uses
them to support a specific argument or persuade the reader to accept a
certain point of view.
KHE-LCD-SGD-00039 8
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The following are examples showing how you could collect primary data:
• Experiment
For a batch of 50,000 new mobiles phones that were manufactured,
the Factory Manager can collect a random sample of 100 phones,
fully charge the phone and determine the average stand-by time
• Survey
To determine whether Kaplan students are happy with the campus
facilities, we can conduct a survey on 200 students
• Observation
To decide whether the duration of 2 hours for Statistics exam is
sufficient, the lecturer can observe the time taken by a group of
students during the next Statistics exam
KHE-LCD-SGD-00039 9
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
For example, if Kaplan wants to conduct a survey involving 100 of its students, it
would have to ensure that it the sample of 100 students represents all Kaplan
students. If Kaplan only surveys all local students, this will be considered as a
bias since this does not represent the actual distribution of Kaplan students by
countries.
Note that we would need to serialize the population for this process to work.
KHE-LCD-SGD-00039 10
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Sampling Methods
https://www.youtube.com/watch?v=pTuj57uXWlk
KHE-LCD-SGD-00039 11
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In this section, we shall discuss the various methods to present data in tables
and charts.
You will discover that it is very important to differentiate the types of data so that
we could present the information using appropriate tools. Therefore, before you
begin this section, please do a quick review on the difference between qualitative
and quantitative data.
We will discuss the tools that are commonly used to present qualitative data
(Deborah, 2016). Recall that qualitative data are also known as categorical data.
Here are some examples of qualitative variables:
Your Country
Types of Diploma
Favourite Colour
KHE-LCD-SGD-00039 12
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Summary Table
For qualitative data, we will create a Summary Table in which the first column
indicates the name of the variable (e.g. Race in Singapore for the above table).
We than create another column indicating the percentage for respective category.
We could also create another column showing the frequency for each category, if
necessary. Frequency is defined as the number of occurrence of that category.
We can similarly work out the percentage of the rest of the categories and
display in according to the above table.
KHE-LCD-SGD-00039 13
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Bar Chart
A Bar Chart is a common tool used to present qualitative data. A bar chart or bar
graph is a chart with rectangular bars with lengths proportional to the values that
they represent. The bars can be plotted vertically or horizontally. A vertical bar
chart is sometimes called a column bar chart.
Race in Singapore
Others
Indians
Malays
Chinese
One axis of the chart shows the specific categories being compared, and the
other axis represents the percentage or frequency. In this example, you will
notice that we have use the vertical axis to shows the races (Chinese, Malays,
Indians & Others). On the horizontal axis, we have shown the percentage.
KHE-LCD-SGD-00039 14
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Pie Chart
A Pie Chart is another commonly used tool to present qualitative data. Pie charts
can be an effective way of displaying information in some cases, in particular if
the intent is to compare the size of a slice with the whole pie, rather than
comparing the slices among them.
Others
3% Race in Singapore
Indians
9%
Malays
14%
Chinese
74%
A pie chart is a circular chart divided into sectors, illustrating proportion. In a pie
chart, the size of each sector (and consequently its angle), is proportional to the
quantity it represents. The angle of a whole circle is 360º and it represents 100%.
As such, an angle of 3.6⁰ will represent 1% of the proportion. Therefore, to
represent 74% of Chinese, we would need an angle of 74 x 3.6 = 266.4º.
KHE-LCD-SGD-00039 15
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We have seen how we could present qualitative data in Summary Tables, Bar
Charts and Pie Charts. We shall now discuss how we could present
quantitative data.
Recall that quantitative data is numerical data. The following are some examples
of quantitative data:
Number of Modules in this Term
Number of Students in the QA Classes
School Fee for Diploma Courses
KHE-LCD-SGD-00039 16
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
In this example, we noted that the data ranges from 12 to 58. As such, we could
use five class intervals as shown:
10 to 20 3 15
20 to 30 6 30
30 to 40 5 25
40 to 50 4 20
50 to 60 2 10
TOTAL 20 100
KHE-LCD-SGD-00039 17
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Cumulative frequency of a class interval is the sum of the frequency of the class
and all frequency that is less than the class. For example, for the class interval
“30 to 40”, the cumulative frequency is 3 + 6 + 5 = 14. Similarly, we can work out
the cumulative frequencies of other classes as well as the cumulative percentage.
10 to 20 3 15 3 15
20 to 30 6 30 9 45
30 to 40 5 25 14 70
40 to 50 4 20 18 90
50 to 60 2 10 20 100
TOTAL 20 100
KHE-LCD-SGD-00039 18
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Histogram
You may note that histograms are very similar to bar charts. The differences are:
Bar charts are for categorical data while histograms are for numerical data
Bar charts have gaps between each bar while histograms do not have
gaps
Bar charts can be displayed vertically or horizontally while histograms are
usually displayed vertically
Histogram
https://www.youtube.com/watch?v=YLPDPglvePY
KHE-LCD-SGD-00039 19
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Ogive
KHE-LCD-SGD-00039 20
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The following table shows the daily high temperature (in ºF) of a village in China.
Construct a Histogram and Ogive using the following frequency distribution:
Temperature Number
of Days
10 to 20 1
20 to 30 3
30 to 40 5
40 to 50 4
50 to 60 2
TOTAL 15
KHE-LCD-SGD-00039 21
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Summary
Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.
KHE-LCD-SGD-00039 22
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
REFERENCES
Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.
Deborah, J. (2016). Statistics For Dummies. New York, United States: John
Wiley & Sons Inc.
KHE-LCD-SGD-00039 23
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In this topic, we shall discuss the various methods to summarise numerical data
into useful measurements. Firstly, we will introduce the three measurements of
central tendency – Mean, Median and Mode. Thereafter, we will learn how to
measure variation of data using both Standard Deviation and Interquartile
Range. Finally, we will also introduce some important symbols for parameters
and statistics. To do all these computations, you will need a non-programmable
scientific calculator.
Before we begin, recall that a population consists of all the items or individuals
about which you want to draw a conclusion. We usually do not have the
population. As such, we collect a sample which is a subset of the population for
analysis. Therefore, in this topic, we shall emphasize the measurements for a
sample. We will briefly discuss the measurements of the population at the end of
the topic.
Learning Outcomes
The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:
KHE-LCD-SGD-00039 24
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We shall begin with the Measures of Central Tendency for a sample data.
Central tendency relates to the way in which quantitative data tend to cluster
around some value. A measure of central tendency is any of a number of ways of
specifying this "central value". In practical statistical analysis, the terms are often
used before we have chosen even a preliminary form of analysis. Hence, an
initial objective might be to choose an appropriate measure of central tendency.
In this section, we shall look at three measures of central tendency – Mean,
Median and Mode (Berenson,2015).
Sample mean is the most common measure of central tendency and is important
in performing statistical analysis. For example, sample mean could be used to
determine the average score of the QA quiz for our class.
To compute the sample mean, we need to add up all the data and divide by the
total number of observations.
Example: 3, 5, 6, 8, 9
35 689
X 6 .2
5
KHE-LCD-SGD-00039 25
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Sample Mean
https://www.youtube.com/watch?v=lBlnjzHVUYU
Sample Median is the numerical value separating the higher half of a sample
from the lower half. The sample median can be found by arranging all the
observations from lowest value to highest value and picking the middle one.
Example A: 3, 5, 6, 8, 9
KHE-LCD-SGD-00039 26
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example B: 3, 5, 6, 8, 9000
Comparing Example A and B, we note that the only difference is the number 9 in
Example A is replaced by 9000 in Example B. Nevertheless, we observed that
the sample median still remains as 6. Therefore, unlike sample mean, sample
median is not affected by extreme values (Rehill, n.d.).
If we have a large sample, it may be more efficient to work out the median
position of the ordered data to locate the middle value. The formula for the
median position is as follows where n is the sample size discussed earlier:
n 1
Median position of the ordered data
2
KHE-LCD-SGD-00039 27
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
0 0 5 7 8 9 12 14 22 33
Middle two
values
Referring to the above example, we have a sample of 10 data (i.e. n = 10). Firstly,
we need to rearrange the data from smallest to biggest. Thereafter, we use the
formula provided to determine the median position as 5.5.
Since 5.5 is between 5 and 6, we need to take the average of the 5th value (i.e. 8)
and the 6th value (i.e. 9). Finally, we can conclude that the median is 8.5
(average of 8 and 9).
Note that median position is not the median; it is just the location of the median.
As such, when you compute the sample median, do use the presentation
provided above as your example. In addition, always remember to arrange your
data first when you compute the sample median.
Sample Median
https://www.youtube.com/watch?v=SMwRMkvxik0
KHE-LCD-SGD-00039 28
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Sample Mode is the value that occurs most frequently in the sample.
Example A:
In Example A, the mode is 9 since this number appears the most number of
times.
Example B:
In Example B, since all the numbers appeared the same amount of times, we
conclude that the sample has no mode.
Example C:
KHE-LCD-SGD-00039 29
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In statistics, the concept of the Shape of the Distribution refers to the shape of
a probability distribution and it most often arises in questions of finding an
appropriate distribution to model the statistical properties of a population when
given a sample (MathBits, 2018).
We will learn the concept of probability distribution in subsequent topics. For now,
it suffices to be able to identify the shape of the distribution of a sample through a
histogram.
As we will discover in later topics, the most important shape of distribution is the
Bell Curve. This is a symmetrical distribution. In this case, the mean is equal to
the median.
When the mean is less than the median, we will regard this as a left-skewed
distribution and the shape of the curve is shown on the left picture on the above.
On the other hand, when the mean is more than the median, we have a right-
skewed distribution as shown on the right picture on the above.
Shape of Distribution
https://www.youtube.com/watch?v=A_8cfQJeqjs
KHE-LCD-SGD-00039 30
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Is it left- or right-skewed?
KHE-LCD-SGD-00039 31
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The measures of central tendency give us a feel for the centre of the data.
However, these measurements are not sufficient as we do not have a good feel
as to whether the data are close together or widely spread out.
We shall now take a look at the first measure of variation – Sample Standard
Deviation. To obtain the standard deviation, we need to compute the Sample
Variance then take the square-root of this value (ThoughtCo, n.d.).
(X X)
i
2
S2 i1
n -1
KHE-LCD-SGD-00039 32
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The formula may seem challenging at a first glance but it is quite easy to apply.
Take a look at the following example and you will be more comfortable with it.
Example: 10 12 14 15 17 18 18 25
The standard deviation will have the same unit of measurement as the data. For
example, if the data refers to the heights of students (in metres) then the
standard will also have the same unit – metres.
Step 3: Compute the sample standard deviation by taking the square root of
variance (4.6 for this example)
Do also pay attention to the symbols used for sample variance (s2) and sample
standard deviation (s). These symbols will come in handy when we deal with
later topics.
Standard Deviation
LMS Learning Outcome 2.2
https://elearn-diploma.kaplan.com.sg
KHE-LCD-SGD-00039 33
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Find the
Mean
Variance
Standard Deviation
KHE-LCD-SGD-00039 34
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Standard deviation shows how much variation or dispersion exists from the mean.
Very loosely, we can use standard deviation as an indication of the “average gap”
between the data and the sample mean. A low standard deviation indicates that
the data points tend to be very close to the mean; high standard deviation
indicates that the data points are spread out over a large range of values.
Take a look at the above three examples. All three examples have the same
sample mean (15.5). By looking at the mean, we will have no clue about the
variations of the three samples. If we examine the sample standard deviation, we
will notice that Sample B has the smallest value which indicated that the data are
close to the sample mean. Conversely, Sample C has a big standard deviation
which indicated that the data are far away from the sample mean.
KHE-LCD-SGD-00039 35
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Coefficient of Variation
In such cases, we need to derive the Coefficient of Variation from the sample
mean and standard deviation as shown in the following formula:
S
CV 100
X
The coefficient of variation represents the ratio of the standard deviation to the
mean in percentage. It is a useful statistic for comparing the degree of variation
from one sample data to another, even if the means are drastically different from
each other.
In the above examples, both have the same standard deviation ($5) but their
means differ. When we compute the coefficient of variation, we note that Sample
B has a much higher value (50%). Therefore, we conclude that Sample B has a
bigger variation although both have the same sample standard deviation.
KHE-LCD-SGD-00039 36
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
KHE-LCD-SGD-00039 37
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The next measure of variation is the interquartile range. To compute this statistic,
we would need to learn the concept of Quartiles first. The Quartiles of a sample
are the three points that divide the data (indicated by arrows in the following
diagram) set into four equal groups, each representing a fourth of the population
being sampled (Weiss, 2017).
The first quartile (Q1) is the value in which 25% of the observations are smaller
than this value.
The second quartile (Q2) is the value in which 50% of the observations are
smaller than this value. You may also realize that the second quartile is actually
the median.
The third quartile (Q3) is the value in which 75% of the observations are smaller
than this value.
We can use the above formulas to work out the Quartiles Positions. Recall that
n is the symbol for the sample size.
KHE-LCD-SGD-00039 38
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
If n = 11, then Q1 position = (11 + 1)/4 = 3. Hence, we take the 3rd value as Q1.
If n = 12, then Q1 position = (12 + 1)/4 = 3.25 ≈ 3. Hence, we take the 3rd value
as Q1.
If n = 14, then Q1 position = (14 + 1)/4 = 3.75 ≈ 4. Hence, we take the 4th value as
Q 1.
If n = 13, then Q1 position = (13 + 1)/4 = 3.5. In this case, just like the way we
deal with median, Q1 will be the average of the 3rd and 4th values.
As a general rule, if the quartiles positions are fractional half (e.g. 2.5, 7.5, 8.5),
we would need to average the two adjacent values to obtain the respective
quartiles. Otherwise, we just round off the value to the nearest whole number to
locate the respective quartiles.
KHE-LCD-SGD-00039 39
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The following example illustrates how we compute the quartile position and the
respective quartiles. Do note that we need to arrange our data before we
proceed with the computation. You may want to give some attention to the
presentation of the workings too. Do not mix up quartile positions (which are the
locations of the quartiles) and the quartiles.
Q2 position = 2(9+1)/4 = 5 so Q2 = 16
The Interquartile Range (IQR) is defined as the difference between Q1 and Q3.
IQR = Q3 – Q1
IQR will have the same unit as the data. Note that within the IQR, it will contain
the middle 50% of the data.
If the IQR is small, we can conclude that the middle 50% of the data are close to
the median. Hence, the variation of the data is small.
If the IQR is very large, we can conclude that the middle 50% of the data are far
apart. In this case, the data at the two ends will be even further apart. Hence, the
variation of the data will be large.
Example:
KHE-LCD-SGD-00039 40
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
$44, $49, $55, $39, $28, $39, $38, $56, $59, $64
KHE-LCD-SGD-00039 41
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
So far, we have learned three measures of central tendency (mean, median and
mode) and two measures of variation (standard deviation and interquartile range).
where
N = population size
This is just the average of all the values in the population. In usual practice, we
do not have the population and hence there is not much opportunity to apply the
formula to compute the population mean. Nevertheless, it is important to note the
symbol µ (pronounce as meu) which represents the population mean.
The formulas for the Population Variance are shown above. The computations
are very similar to the sample variance. Like the population mean, it is more
important to be familiar with the symbols σ2 (read as sigma-square)
KHE-LCD-SGD-00039 42
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The following table shows the key symbols which will be used frequently in the
discussion of subsequent topics:
You are strongly encouraged to commit these symbols into memory. You will
soon find out that it is very much easier to understand the concepts and interpret
the questions when you are very familiar with these symbols.
KHE-LCD-SGD-00039 43
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Summary
Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.
Measurement of Variation
- Standard Deviation, Coefficient of Variation
- Quartiles, IQR
KHE-LCD-SGD-00039 44
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
REFERENCES
Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.
KHE-LCD-SGD-00039 45
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Topic 3: Probability
In this topic, we will learn some basic concepts of probability. This will enable you
to be familiar with the meaning of probability and equip you for more challenging
concepts of probability distributions in later topics.
Learning Outcomes
The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:
KHE-LCD-SGD-00039 46
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Probability is derived from the word probably. For example, if you ask your friend
whether he is going to the party tonight, he may answer ‘probably’ or ‘may be’. If
you ask further what are the chances that he will be there? He may say a 70%
chance. In terms of probability, it is considered as 0.7; just divide the percentage
by 100.
A probability of zero means that the event will definitely not happen. For example,
if we roll a dice and we want to obtain a ‘7’, this is practically not possible and
hence the probability is zero.
A probability of one means the event will surely happen. For example, if we roll a
dice and we want the number that appears to be less than 7. As we know, a
normal dice will only have numbers 1 to 6. Therefore, we are very sure the
number appearing will be less than 7 and hence this event will occur with a with
probability of 1.
Recall that in Topic 1 we have discussed the two types of variables (Qualitative
and Quantitative) as shown above. In this topic will be introducing a few generic
formulas and probability principles which apply to all situations. Nevertheless, in
our example, we will apply it specifically to the categorical variables. We will
discuss probability for discrete and continuous variables in subsequent topics.
KHE-LCD-SGD-00039 47
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
n( E )
P( E )
Total
An event is a set of outcomes which are of interest to us. Probability describes
the statistical number of outcomes considered divided by the number of all
outcomes (PISHRO-NIK, 2014).
We usually use the notation ‘E’ to represent the event of interest. The upper case
‘P’ stands for probability. For example, P(E) will stand for the probability of the
event E occurring. The notation ‘n(E)’ stands for the number of outcomes in the
event E. The denominator ‘Total’ in the above formula refers to the total number
of outcomes.
Example:
No. of Days
Total 360
n(Raining) 148 37
P( Raining )
Total 360 90
In this example, the event is 'today is a raining day'. We observe that there were
148 raining days out of a total of 360 days. Therefore, we can apply the
probability formula to obtain the probability of raining.
KHE-LCD-SGD-00039 48
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We shall now introduce the three basic types of events that we generally
observe in the determination of probability (Pennsylvania State University, 2018).
A Complement of an Event is the set of all outcomes that are Not included in
the outcomes of the event. This is denoted by A’. A simple way to consider the
complement of an event is to negate the event.
For example, the complement of ‘selecting a diamond card’ is NOT
selecting a diamond cards (i.e. selecting hearts, spades or clubs).
KHE-LCD-SGD-00039 49
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We will introduce the following three common tools that are used to visualise
events:
Contingency Table
Tree Diagram
Venn Diagram
These tools will assist you to appreciate the situation by displaying the events in
a pictorial form. Thereafter, we could determine the desired probability much
easier with the assistance of these tools.
Marital Status
Gender Male 65 25 90
Female 48 72 120
In the above example, we have used the contingency table to display the joint
frequencies of Gender and Marital Status. From the table in the above slide, we
observe the following:
KHE-LCD-SGD-00039 50
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
Find the:
b) P(Pass Stats)
a)
b)
KHE-LCD-SGD-00039 51
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Black 3 2 5
White 1 1 2
Blue 2 1 3
Total 6 4 10
Calculate
1. P(new blue shoe)
2. P(old shoe)
KHE-LCD-SGD-00039 52
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In the above example, we displayed at the first level the outcomes of your QA
course - Pass or Fail. These events are represented by two branches starting
from the same common point. We displayed the probability for each outcome on
the respective branch. Noted that the total probabilities of the branches is 1. This
will always be so since total probabilities is always 1.
The second level further branch out from the first-level outcomes. If you pass QA,
you could go for party or a holiday. If you Fail QA, you could remodule or quit the
course. Note again that we indicated the probabilities on the branches.
In this way, if we are interested in the event ‘Pass and Party’, we can trace the
desired branch accordingly. Usually, we make use of the tree diagram to display
conditional probabilities. We will discuss this in later part of this topic.
KHE-LCD-SGD-00039 53
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The third visualizing tool is the Venn Diagram. A Venn diagram is a diagram that
shows all possible logical relations between events (Lucidchart, 2018). In the
above example, Event A is represented by the smaller circle while Event B is
represented by the bigger circle. The event A and B is represented by the
overlapping region and the Event A or B is represented by the total region within
the two circles.
Venn Duagram
https://www.youtube.com/watch?v=b6t0994ZZDA
KHE-LCD-SGD-00039 54
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
If P(A) = 0.3, P(B) = 0.5, P(A and B) = 0.1, use Venn Diagram to find
b) P(A or B)
KHE-LCD-SGD-00039 55
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
So far, we only learnt one formula that can be used to determine the probabilities
of a simple event and joint events. We shall now discuss another second formula
as stated above which is known as the General Addition Rule (Weiss, 2017).
KHE-LCD-SGD-00039 56
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
KHE-LCD-SGD-00039 57
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Black 3 2 5
White 1 1 2
Blue 2 1 3
Total 6 4 10
KHE-LCD-SGD-00039 58
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Previously, we had discussed three basic types of events - Simple event, Joint
event and Complementary event. We shall discussion one more type of event
known as Mutually Exclusive Events.
Events A and B are said to be mutually exclusive if when event A occurs then
event B cannot happen. Vice-versa, if event B happens then event A cannot
happen (Varadhan, 2001).
P(A and B) = 0
By the definition of mutually exclusive, we can conclude that the joint probability
of two mutually exclusive events is zero. i.e. P(A and B) = 0 when A and B are
mutually exclusive.
Hence, using the General Addition Rule, if events A and B are mutually exclusive,
P(A or B) = P(A) + P(B).
KHE-LCD-SGD-00039 59
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The following table shows the joint distribution for engineers and scientists by
highest degree obtained:
KHE-LCD-SGD-00039 60
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
A conditional probability is the probability that an event will occur, when another
event is known to occur or have already occurred (Berenson, 2015). Consider
two events A and B, if we knew that event B has already occurred and we want
to find the probability of event A, we would require the conditional probability of A
given that B has occurred. The symbol we use to represent this is P(A|B). We
read this as ‘probability of A given B’.
However, imagine we know that today is your QA exam. That is, we know that
event B is sure to happen (a given condition). Therefore, the probability of 'You
will come to school' (event A) given that 'Today is your QA exam' (event B) is
going to change. This probability is a conditional probability -- P(A|B).
P(A and B)
P(A | B)
P(B)
The Formula for Conditional Probability turns out to be quite easy. It is just the
joint probability of event A and B divided by the probability of event B. As
mentioned earlier, the challenge is not in applying the formula but in recognizing
the situations where conditional probability is required.
KHE-LCD-SGD-00039 61
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
It is noted that among all Kaplan Students, 70% are foreign students, 40% are
male and 20% are both. What is the probability that a student is male, given that
he is a foreign student?
We note that:
P(Male | Foreign)
P(Male and Foreign) 0.2 2
P(Foreign) 0.7 7
KHE-LCD-SGD-00039 62
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Download Games 38 42 80
Never download Games 70 150 220
Total 108 192 300
If a student who purchased iPhone is randomly selected, what is the chance that
he will download Games?
KHE-LCD-SGD-00039 63
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In other words, the joint probability of A and B can be written as the multiplication
of the conditional probability P(A|B) and the probability P(B). We refer to the
above expression as Multiplication Rule (Berenson, 2015).
Let’s revisit the Tree Diagram we learnt earlier. When dealing with conditional
probabilities, the first level of the tree are the probabilities of simple events.
However, when we extend to the second level, we need to show the conditional
probabilities.
In the above diagram, the first branch breaks out to events B and B’. P(B) and
P(B’) are stated on the branches. Following from each branch (i.e. B and B’), we
extended it to events A and A’. Notice that in the second level, we indicated
respective conditional probabilities such as P(A|B), P(A’|B), etc.
To work out the joint probability of A and B, we just trace the appropriate branch
and multiply the probabilities along the path. i.e. P(A and B) = P(B) x P(A|B).
Similarly, P(B’ and A) = P(B’) x P(A|B’).
You may have realised by now that we are actually using the multiplication rule
by doing the above. Therefore, the tree diagram is actually an easy way to apply
multiplication rule using visualization!
KHE-LCD-SGD-00039 64
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
On his way to office, Mr Tan walks past a newspaper store every morning. He is
likely to buy a Straits Times or Business Times (not both) but in some days he
does not buy any papers. The chance that he will buy Straits Times is 0.5 and
the chance that he will buy Business Times is 0.3. If he bought Straits Times, he
is likely to bring home in the evening with a 70% chance. If he bought Business
Times, then the chance is 20%.
Using the tree diagram, find the probability that Mr Tan didn’t bring any
newspaper home tonight?
KHE-LCD-SGD-00039 65
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Two events are Independent if the occurrence of one event does not change the
probability of the other occurring (Weiss, 2017). In other words, the two events
are not related.
For example, event A represents obtaining a 2 when rolling a dice and event B
represents obtaining a Head when flipping a coin. We say that these two events
are independent since rolling a 2 does not affect the probability of flipping a head
and vice-versa.
P(A|B) = P(A).
The expression explains that the probability of A happening, given that B has
happened, remains unchanged. That is, B has no effect from the occurrence of A,
which is exactly the meaning of independence.
KHE-LCD-SGD-00039 66
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
1. A female student has just sat for the Statistics exam, what is the
probability that she will fail?
KHE-LCD-SGD-00039 67
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Summary
Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.
Types of Events
Visualizing Events
Conditional Probability
KHE-LCD-SGD-00039 68
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
REFERENCES
Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.
KHE-LCD-SGD-00039 69
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Learning Outcomes
The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:
KHE-LCD-SGD-00039 70
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
You may recall that there are two types of variables – Qualitative (categorical)
and Quantitative (numerical). For quantitative variables, we can further divide
them into Discrete and Continuous variables. For this topic, we are dealing with
probabilities of discrete variables.
Firstly, we will discuss the concept of Probability Distribution and how we could
construct this for discrete variables. Thereafter, we will compute the mean and
standard deviation for discrete probability distributions.
KHE-LCD-SGD-00039 71
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
2 0.2
3 0.4
4 0.24
5 0.16
The above table shows the Probability Distribution of a discrete variable. Here,
the variable of interest is ‘Number of Modules Taken’. Note that this is a discrete
variable since the possible values are 2, 3, 4 or 5 (fixed numbers and not
connected).
The column on the right shows the probability for each possible value. Sum up
the probabilities on the right column and you should get the total probabilities of 1.
The table presents the distribution of the probabilities for a discrete variable and
it is therefore known as discrete probability distribution.
KHE-LCD-SGD-00039 72
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Next, for each value of X, we compute the respective probability using our basic
probability formula. Make sure that you do a final check by adding up all the
probabilities and it should be equal to one. So, we have managed to present the
discrete probability distribution in a table form.
x Probability
0 0.25
1 0.5
2 0.25
Did you notice that this chart shows exactly the same information presented in
the probability distribution table? We can determine the probability of each value
of X by reading the chart. Hence, the above chart is also considered as
probability distribution.
KHE-LCD-SGD-00039 73
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Once we have constructed the probability distribution, we can compute the Mean
for a Discrete Probability Distribution, using the following formula. The mean
is also known as the Expected Value (nzmaths, n.d.).
N
μ E(X) X i P( X i )
i 1
Example:
x P(x)
0 0.25
1 0.5
2 0.25
Note that we have use the symbol µ, population mean, to represent the mean.
This is because probability distribution always refers to the population and hence
its mean is always referred to the population mean.
We can repeat this experiment many times (say 1 million times) and note down
the value of X for each experiment. Take the average of these values and the
answer is likely to be very close to the mean we found above (i.e. 1.25) since we
have a very large sample. Theoretically, if we repeat the experiment infinitely
many times (i.e. we get the population for X), the average value of X will be
exactly 1.25.
KHE-LCD-SGD-00039 74
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Once we have determined the mean, we can use the following formula to
compute the Variance and Standard Deviation of Discrete Probability
Distributions (Morganstern, 2013). Recall from Topic 2, whenever we want to
determine the standard deviation, we need to find the variance first and then take
square root of it.
σ 2 [X i μ] 2 P (X i )
σ var iance
Example:
x P(x)
0 0.25
1 0.5
2 0.25
σ 0.5875 0.766
KHE-LCD-SGD-00039 75
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
x P(x)
0 0.40
1 0.30
2 0.10
3 0.15
4 0.05
• Find Mean.
KHE-LCD-SGD-00039 76
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
To use a known probability distribution model, we have to firstly ensure that the
situation fit the conditions of the model. Thereafter, we can make use of the
known result (formulas) such as the probability distribution function, mean and
standard deviation that describe the model. In this section, we will examine one
very important model - Binomial Distribution (Weiss, 2017).
3) For each observation, the probability of getting the outcome that we are
interested (success) should be the same (denoted by π).
KHE-LCD-SGD-00039 77
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Here are some common business applications that can be fitted into a Binomial
model. In general, whenever we want to determine the number of occurrences of
a certain characteristic (i.e. success) when there is a fixed number of
observations (i.e. n), we can immediately consider the possibility of fitting the
situation into a Binomial model.
Binomial Model
https://www.youtube.com/watch?v=59v5aZ8NMpk
Once we have determined that we can use the Binomial model, we can define X
by using the following prescribed format:
KHE-LCD-SGD-00039 78
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
where
n: number of trials
P(x) is known as the probability function. In this case, since the formula is
applicable to Binomial model, the above formula is called Binomial probability
function or in brief, Binomial Distribution. By substituting the desired value of x
into P(x), we can determine the probability of x success occurring in n
observations.
KHE-LCD-SGD-00039 79
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
a) Supposedly we toss a dented coin 10 times and the chance that the coin
shows a Head in each throw is 10%. Is this Binomial model? What is n and π?
n= π =
P(x) =
P(3) =
Binomial Distribution
LMS Learning Outcome 4.1
https://elearn-diploma.kaplan.com.sg
KHE-LCD-SGD-00039 80
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
n= π= X=
P(x) =
KHE-LCD-SGD-00039 81
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Note that:
P0 + P1 + P2 + P3 + . . . . + Pn = 1 so
a) P(X = 2)
b) P(X ≤ 2)
c) P(X > 2)
d) Probability of at most 2
e) Probability of at least 2
KHE-LCD-SGD-00039 82
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We have observed that the moment we could fit a situation into a Binomial model,
we can immediately use the Binomial probability function to determine the
desired probabilities. Besides this, we can also determine the Binomial Mean
and Binomial Standard Deviation (Weiss, 2017) by using the following formulas:
n= π=
Mean =
Variance =
Standard Deviation =
KHE-LCD-SGD-00039 83
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Calculate the
KHE-LCD-SGD-00039 84
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
A period of observations
e.g. in 1 day, 5 minutes, within 1 page
Note that the Binomial and Poisson models share some similarities. Both models
are for discrete variables and they are used to count the number of desired
observations (i.e. success).
KHE-LCD-SGD-00039 85
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Here are some common business applications that can be fitted into a Poisson
model:
Apart from observing the number of success over a time intervals, the Poisson
experiment also applies to a region of space. For example, over a page or over a
length of road.
Once we have determined that we can use the Poisson model, we can define X
by using the following prescribed format:
KHE-LCD-SGD-00039 86
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
e x
P( X)
X!
where
P(x) is known as the probability function. In this case, since the formula is
applicable to Poisson model, the above formula is called Poisson probability
function or in brief, Poisson Distribution. By substituting the desired value of x
into P(x), we can determine the probability of x success occurring within the
given period.
e x
P( X)
X!
P(x) =
P(3) =
KHE-LCD-SGD-00039 87
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
If λ = 1.2, find
a) P(X = 4)
b) P(X < 2)
c) P(X ≥ 2)
d) Probability of at most 1
e) Probability of at least 2
KHE-LCD-SGD-00039 88
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We have observed that the moment we can fit a situation into a Poisson model,
we can use the Poisson probability function to determine the desired probabilities.
Besides this, we can also determine the Poisson Mean and Poisson Standard
Deviation (Sharma, 2014) as follows:
Poisson Distribution
LMS Learning Outcome 4.3
https://elearn-diploma.kaplan.com.sg
KHE-LCD-SGD-00039 89
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
From past records, the owner of a fast food restaurant knows that, on average,
8.3 cars use the drive-through windows in 1 hour. Assuming that this event
follows a Poisson probability distribution, calculate the
c) mean and standard deviation for the number of cars that use the drive-
through windows per hour
KHE-LCD-SGD-00039 90
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Summary
Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.
KHE-LCD-SGD-00039 91
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
REFERENCES
Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.
nzmaths. (n.d.). Expected value (of a discrete random variable). Retrieved from
https://nzmaths.co.nz/category/glossary/expected-value-discrete-random-
variable
Sharma, JK. (2014). Business Statistics. India: Vikas Publishing House Pvt Ltd.
KHE-LCD-SGD-00039 92
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Learning Outcomes
The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:
KHE-LCD-SGD-00039 93
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
You may recall that there are two types of variables – Qualitative (categorical)
and Quantitative (numerical). For quantitative variables, we can further divide
them into Discrete and Continuous variables. For this topic, we are dealing with
probabilities of continuous variables.
Recall that when we deal with probabilities for discrete variables, we would need
to construct probability distributions. In Binomial and Poisson models, we actually
have formulas for probability distributions. The symbol we used for probability
distribution is P(x). This is also known as a probability mass function.
KHE-LCD-SGD-00039 94
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
For continuous variables, we will not have probability mass functions P(X).
Instead, we will be using Probability Density Function, denoted by f(x), to
represent the distribution (Mejlbro, 2009). A probability density function is a
function that describes the relative likelihood for this random variable to take on a
given value. Depending on the formula for f(x), we can obtain various curves
when we plot f(x) against x. Here are some examples:
If X is a random variable taking values between 0 and 1, what do you think is the
probability of obtaining a single value (say X = 0.3) in this interval?
The answer for the above question is actually zero. There are infinitely many
points within any interval. As such, the probability of obtaining one single point
(out of infinitely many points) is zero. Therefore, there is no meaning in asking for
probability of one point for a continuous interval. In continuous probability
distributions, we will always compute probability within an interval.
For example:
KHE-LCD-SGD-00039 95
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
This is very different from the approach for discrete variables. In discrete
probability distributions, we always have to work out the probabilities one point at
a time. For example, in discrete probability distributions, if we want P(X ≤ 1), we
need to work out P(X = 0) and P(X = 1) individually before adding up the
probabilities.
You may recall that in discrete probability distributions, if we want to find the
probability of a point (say X = 3), we will substitute the value X = 3 into the
probability mass function to obtain the answer.
For continuous variables, the approach is different. To find the probability within
an interval, we need to find the area under the probability density function
within the interval. For example, to find P(a < X < b), we need to find the area
under the curve of the probability density function for the X value between a and
b. The desired area is indicated in the shaded area:
KHE-LCD-SGD-00039 96
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Since area represents probability in continuous distributions, the total area under
the probability density function (i.e. the curve) is 1.
KHE-LCD-SGD-00039 97
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We have discussed the concept of probability density function and the approach
to find probability for continuous variables. We shall now introduce a very
important continuous probability distribution known as Normal Distribution
(Berenson, 2015).
As you can see, the formula of the probability density function for the Normal
distribution is rather scary. Not to worry, as we have mentioned earlier, for
continuous variables, the formula is actually not important. What we need is
actually the curve of the probability density function.
KHE-LCD-SGD-00039 98
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The Empirical Rule (TutorVista.com, n.d.) states that for a normal distribution,
approximately:
68% of the data will fall within 1 standard deviation of the mean
95% of the data will fall within 2 standard deviations of the mean
Almost all (99.7%) of the data will fall within 3 standard deviations of the
mean
X ~ N(µ,σ 2)
We read the above notation as 'X follows a normal distribution with mean equal
to µ and variance equal to σ 2 '. We will be using this notation frequently in the
next topic.
Normal Distribution
https://www.youtube.com/watch?v=iYiOVISWXS4
To find the area under the bell curve directly is not easy. We usually make use of
a statistical table to do so. A Statistical Table is a data sheet that provides pre-
computed values of areas under the curves for a desired probability density
function. In the case of normal distribution, we will be using a statistical table for
the standard normal distribution.
KHE-LCD-SGD-00039 99
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Refer to the Standard Normal Table (also known as Z table) at the end of this
book. There are two pages - the first page is for negative Z values while the
second page is for positive Z value.
The values indicated inside the tables refer to the left areas for the respective Z
values. For example, if we want to find P(Z < 2.00), we can look for Z = 2.00 in
the table and note the value provided (in this case it is 0.9772). The value 0.9772
is the left area from Z = 2.00 under the standard normal curve. This area is as
shown above.
It is important for you to be familiar with the Z table. We will be using this table
very frequently in subsequent topics. The following information may be useful for
the next exercise:
KHE-LCD-SGD-00039 100
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Find
Probability of Z Distribution
https://www.youtube.com/watch?v=zZWd56VlN7w
KHE-LCD-SGD-00039 101
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We have learned how to determine the probability for a Z distribution. You’ll recall
that Z distribution is a normal distribution with mean equal to 0 and standard
deviation equal to 1.
For any normal distribution with mean equal to µ and standard deviation equal to
σ, we need to use the following formula to scale it to a standard normal
distribution. And once we have the Z equivalent distribution, we can do exactly
like what we did previously.
X μ
Z
σ
Example:
Let X represents the time it takes to download an image file from the internet.
Suppose X is normally distributed with mean 8.0 and standard deviation 5.0. Find
P(X < 8.6).
Note: μ = 8, σ = 5
= 0.5478
KHE-LCD-SGD-00039 102
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Suppose X is normal with mean 8.0 and standard deviation 5.0. Find
a) P(X > 8.3)
KHE-LCD-SGD-00039 103
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Intelligence Quotients (IQs) is a score derived from one of several tests designed
to assess intelligence. It is suspected that the IQ scores of Kaplan Diploma
students has direct link to their exam performance in Statistics.
Kaplan has conducted a recent study and noted that the IQ scores of Kaplan
students is normally distributed with mean 102 and standard deviation 15.
KHE-LCD-SGD-00039 104
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Summary
Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.
KHE-LCD-SGD-00039 105
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
REFERENCES
Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.
KHE-LCD-SGD-00039 106
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In this topic, we will discuss the concept of Sampling Distribution and examine
Sampling distributions of Sample Mean and Sample Proportion.
Learning Outcomes
The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:
KHE-LCD-SGD-00039 107
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
KHE-LCD-SGD-00039 108
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Consider the population of all the individual heights of Kaplan students. Suppose
we repeatedly take samples of a given size (n) from this population and calculate
the average for each sample (i.e. the sample mean 𝑥̅ ). Each sample has its own
average value, and the probability distribution of these averages is called the
Sampling Distribution of the Sample Mean (Sharma, 2014).
We will discover later that the sampling distribution depends on the underlying
distribution of the population, the statistic being considered and the sample size
used.
Sampling Distribution
https://www.youtube.com/watch?v=Zbw‐YvELsaM
KHE-LCD-SGD-00039 109
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We have introduced the concept of sampling distribution and used the example
of the sample mean to illustrate the idea. We shall now examine the Sampling
Distribution of Sample Mean in greater detail.
KHE-LCD-SGD-00039 110
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Theory 1:
2
X ~ N ( , ) 2
X ~ N ,
n
Theory 1 states that if the underlying population is normally distributed then the
sampling distribution of sample means will also be normally distributed (Weiss,
2017). The mean and standard deviation of the sampling distribution will be µ
and / n as per our previous discussion.
X not normal
2
X ~ N ,
and n 30 n
Theory 2 states that if the underlying population is NOT normally distributed but
the sample size that we use is large (greater or equal to 30), then the sampling
distribution of sample means will be approximately normally distributed
( Weisstein, 2018). The mean and standard deviation of the sampling distribution
will be µ and / n as per previous discussion.
KHE-LCD-SGD-00039 111
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In the last topic, we learnt that for any normal distribution, we need to scale it to
the standard normal distribution before we can determine the probabilities. You’ll
recall that the standard normal distribution (or Z distribution) is a normal
distribution with a mean equal to 0 and standard deviation equal to 1.
From Theories 1 and 2, when the conditions are met, the sampling distribution of
sample mean will be normally distributed with mean equal to µ and standard
deviation equal to / n . Therefore, we could convert the normally distributed
sampling distribution to Z distribution using the following formula:
X μ
Z
σ
n
Example (Theory 1):
i.e. 15, 7, n 16
2
X ~ N (15,7 )2
X ~ N 15, 7
16
P ( X 18)
1 0.9564
0.0436
X μ 18 15
Z 1.71
7
n 16
KHE-LCD-SGD-00039 112
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The above example illustrates the case when the underlying population is
normally distributed. Here, we need to quote Theory 1 to support the fact that the
sampling distribution of the sample mean is normally distributed.
The underlying population has µ = 15 and σ = 7. We want to find P( X > 18) when
the sample size n = 16. Our first task is to sketch the normal distribution curve
for X . Notice that when the axis is X , the centre is 15 (which is the value for µ).
We indicate the value X = 18 and shade the required right area.
Our second task is to construct another Z axis directly below the X axis. Recall
that the centre for the Z axis is zero. We now need to convert the value X = 18
to a corresponding Z value. To do this, we have used the formula we just
introduced. We note that the corresponding value is Z = 1.71. We indicate this
value directly below X = 18. Therefore, to find P( X > 18) is the same as finding
P(Z > 1.71). In other words,
You may have realised that the approach we used here is very similar to the
previous topic. In fact, these are essentially the same except that the formula we
used for the conversion to the Z axis is different now.
KHE-LCD-SGD-00039 113
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
i.e. μ = 8, σ = 3, n = 36
X not normal
2
X ~ N 8, 3
but n 36 30 36
P (7.8 X 8.3)
0.7257 0.3446
0.3811
X μ 7.8 8 X μ 8.3 8
Z 0.4 Z 0.6
3 3
n 36 n 36
The above example illustrates the case when the underlying population is not
normally distributed. Here, we need to quote Theory 2 to support the fact that the
sampling distribution of the sample mean is normally distributed.
You would have realized that the approach we used here is almost the same as
the previous example. The only difference is that we are using Theory 2 instead
of Theory 1 to explain that X is normally distributed.
KHE-LCD-SGD-00039 114
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The height of Kaplan students has mean 1.7m and standard deviation 0.2m. If a
sample of 40 students are selected, what is the probability that the mean height
is between 1.6 and 1.75m?
, , n
KHE-LCD-SGD-00039 115
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We have discussed sampling distribution of the sample mean and introduced two
theories that determine when the distribution is normal. We shall now examine
another sample statistic – Sample Proportion.
n(characteristic)
p Note: Proportion is
sample size between 0 and 1.
Example:
Suppose we know that 40% of Kaplan students like fast food. In this case, the
characteristic of interest is “like fast food” and the population proportion is 0.4. If
we take a sample of 200 Kaplan students and note that 90 of them like fast food
then the sample proportion is 0.45 (from 90/200).
KHE-LCD-SGD-00039 116
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
nπ ≥ 10 and n(1 – π) ≥ 10
Once we know that the sample proportion, p, is normally distributed, we can use
the following formula to convert the sampling distribution of p to the Z distribution
just like the way we deal with X :
pπ
Z
π( 1 π)
n
KHE-LCD-SGD-00039 117
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
If the proportion of Kaplan students who like fast food is 0.4, what is the
probability that, for a sample of 200 students being interviewed, the sample
proportion is between 0.40 and 0.45?
Therefore, p is normal.
P( 0.40 p 0.45 )
0.9251 0.5
0.4251
p 0.45 0.4
Z 1.44
(1 ) 0.4(1 0.4)
n 200
In this example, we note that π = 0.4, n = 200 and we want to find P(0.4 < p <
0.45). Since we need to find the probability relating to the sample proportion p,
we need to confirm that the sampling distribution is normal. As shown above, we
verified that nπ and n(1 - π) are both greater than 10. Therefore, the sample
proportion (p) is normally distributed.
KHE-LCD-SGD-00039 118
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The marketing department has done an analysis recently to determine the source
of students for its Diploma courses. It is noted that 55% of the students are from
China. If a sample of 100 students are randomly selected, what is the probability
that more than 70% are from China?
KHE-LCD-SGD-00039 119
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Summary
Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.
KHE-LCD-SGD-00039 120
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
REFERENCES
Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.
Rumsey, DJ. (n.d.). How to find the Sampling Distribution of a Sample Proportion.
Retrieved from https://www.dummies.com/education/math/statistics/how-
to-find-the-sampling-distribution-of-a-sample-proportion
Sharma, JK. (2014). Business Statistics. India: Vikas Publishing House Pvt Ltd.
Stine, R. & Foster, D. (2014). Statistics for Business. USA: Pearson Education
Ltd.
KHE-LCD-SGD-00039 121
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In the last few topics, we have been discussing the concept of probability. In
particular, we have examined the basic concepts of probability, discrete
probability distributions, continuous probability distributions and sampling
distributions.
With these as our foundation, we are now ready to work on some inferential
statistics. This topic will discuss the concept of Estimation. We will be finding the
Confidence Interval for Population Mean.
Learning Outcomes
The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:
KHE-LCD-SGD-00039 122
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Point Estimation involves the use of sample data to calculate a single value
(known as a statistic) which is to serve as a "best guess" or "best estimate" of an
unknown population parameter (Asadoorian & Kantarelis, 2008).
For example, to estimate the mean age of Kaplan students (parameter µ), we
could use the mean age of our class (say X 18.6 ) as the point estimate. In other
words, based on the sample, we guess that the population mean is 18.6.
Interval Estimate
https://www.youtube.com/watch?v=tFWsuO9f74o
KHE-LCD-SGD-00039 123
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The CIs are different from sample to sample but frequently include the parameter
of interest. How frequently the observed intervals contain the parameter is
determined by the Confidence Level (Stine & Foster, 2014). The common
confidence levels are 90%, 95% and 99%.
Example:
A 95% CI of the population mean (µ) is a range with a lower and upper limits
calculated from a sample. This is used as an interval estimate of µ.
As the true µ is unknown, this range describes possible values that µ could be.
As this is an estimate, the actual µ may or may not be within this interval.
Supposedly we took infinitely many samples (of same size n) and calculated the
95% CI for each sample. We would expect 95% of these CIs will contain µ (NCBI,
2016).
KHE-LCD-SGD-00039 124
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We shall now learn how to construct the Confidence Interval for Population
Mean. It turns out that to find the CI for µ, we would need divide into the following
cases:
Recall that the symbol for population standard deviation is σ. Check that you are
familiar with the symbols used for all the population parameters and sample
statistics (see Topic 2). We will be referring to all these symbols frequently from
this topic onward.
KHE-LCD-SGD-00039 125
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
To find the CI for µ when σ is Known, we can use the following formula
(Berenson, 2015):
σ
X Z/2
n
In order to use this formula, we would require the sampling distribution of the
sample mean to be normal. Recall that we have discussed two theories in the
previous topic that could ensure that X is normally distributed.
We are already familiar with the following symbols used in the formula:
X: Sample mean
σ: Population standard deviation
n: Sample size
Z: Represent standard normal distribution
α: Significant level
Nevertheless, we have not discussed the meaning of the symbol Z α/2. The
symbol Z α/2 mentioned in the formula is known as Critical Value. This is defined
as the Z value when the right area under the standard normal curve is α/2. For
example, to find the 95% CI, α is equal to 0.05. As such, α/2 = 0.025. The critical
value Z0.025 represents the Z value when the right area is 0.025.
KHE-LCD-SGD-00039 126
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Recall that the Z table only provides values for the left areas of respective Z
values. As such, to find Z0.025 , we firstly note that the left area is 0.975. We will
then look at the Z table for the value representing left area equal to 0.975. This
will give us a Z value of 1.96. Therefore, Z0.025 = 1.96. Alternatively, we can also
look for the Z value when the left area is 0.025. After that, we need to reflect to
the positive side to obtain Z0.025 .
We may not always able to find the exact value of the desired left area in the Z
table. In such situations, we shall use the closest value as an approximation.
Determine Z0.05. You will discover that the adjacent values of 0.05 are equally
close to 0.05. Use the midpoint value in such cases.
Our first task is to identify all the numerical values and assign appropriate
symbols to them. Since σ known, we can use the above formula with Z as the
critical value. Note that before we use the formula, we need to confirm that the
sampling distribution of X is normal.
KHE-LCD-SGD-00039 127
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
Determine a 95% confidence interval for the mean resistance of the population.
Next, we verify that the conditions are met before we could use the formula:
σ Known
σ
X Z/2
n
2.20 1.96 (0.35/ 11)
2.20 0.207
1.993 ohms 2.407 ohms
CI for µ (σ known)
LMS Learning Outcome 7.1
https://elearn-diploma.kaplan.com.sg
KHE-LCD-SGD-00039 128
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
CI for µ (σ known)
https://www.youtube.com/watch?v=czdwHU27OqA
KHE-LCD-SGD-00039 129
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We have examined the case when the population standard deviation is known.
We shall now discuss how we could construct the Confidence Interval when
Population Standard Deviation is Unknown.
To construct CI for µ when σ is unknown, we can use the following formula (Lind,
Marchal & Wathen, 2018):
KHE-LCD-SGD-00039 130
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
As the degree of freedom increases, the t-distribution curves will get thinner and
more like the Z curve. Hence, the t values will be approximately equal to the Z
value when the degrees of freedom are large (Magnusson, n.d.).
The following example illustrates the Meaning of the Critical Value t / 2 . Similar
to the critical value Z/ 2 , the critical value t / 2 represent the t value when the
right area under the t curve from this value is α/2.
Example:
Find the critical value for 90% confidence interval when the sample size is 10.
n = 10 then df = n -1 = 10 – 1 = 9
KHE-LCD-SGD-00039 131
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Did you notice that the t values are decreasing as the degrees of freedom
increase? What do you think would be the value of t0.025 when the degree of
freedom is 1000?
KHE-LCD-SGD-00039 132
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Now that we are able to find the critical value for t distribution, we can proceed to
determine the Confidence Interval for Population Mean when the population
standard deviation is unknown.
Example:
Our first task is to identify all the numerical values and assign appropriate
symbols to them:
We need to verify that the conditions are met before we could use the formula:
• σ Unknown (S = 5.4)
• Population normal
S
X tα/ 2
n
62.3 2.064 (5.4/ 25 )
62.3 2.23
60.07 kg 64.53 kg
CI for µ (σ Unknown)
LMS Learning Outcome 7.2
https://elearn-diploma.kaplan.com.sg
KHE-LCD-SGD-00039 133
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
A random sample of 20 bags of sugar was drawn to check on its weight. The
measurement showed that the mean is 1.02kg and the standard deviation is
0.1kg. It is assumed that the weight is normally distributed. Form a 99%
confidence interval for μ.
CI for µ (σ Unknown)
https://www.youtube.com/watch?annotation_id=annotation_100657&featu
re=iv&src_vid=_NGYJxrUGgQ&v=bFefxSE5bmo
KHE-LCD-SGD-00039 134
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Summary
Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.
KHE-LCD-SGD-00039 135
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
REFERENCES
Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.
Lind, DA., Marchal WG., & Wathen, SA. (2018). Statistical Techniques in
Business and Economics. USA: McGraw-Hill Education.
Stine, R. & Foster, D. (2014). Statistics for Business. USA: Pearson Education
Ltd.
KHE-LCD-SGD-00039 136
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Learning Outcomes
The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:
2. Perform Z-Test
3. Perform t-Test
KHE-LCD-SGD-00039 137
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
We claim that the mean monthly handphone bills of Kaplan students is $50
(i.e. µ = 50).
Since we do not have the data of all Kaplan students’ handphone bills, we are
not able to determine the real value of the population mean.
Hence, the claim is a hypothesis since we are not able to verify whether it is true
or false.
The alternative hypothesis is like our standby hypothesis. In the event that we
reject the null hypothesis, we will accept the alternative hypothesis.
Note that the hypotheses will be on the population parameters (in our case, it is
on µ). It will never be on the sample statistics.
KHE-LCD-SGD-00039 138
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
To Set Up the Hypotheses, we start off with the business claim. If the business
claim contains the ‘=’ sign (i.e. =, ≤ or ≥), we will place it at the null hypothesis.
Otherwise, it will be placed at the alternative hypothesis. The alternative
hypothesis will always be the complement of the null hypothesis. That is, when
we combine the null and alternative hypothesis, we will get the set of all real
numbers (Berenson, 2015).
Using the above guide, we can set up one of the following three hypotheses.
Note that the value ‘50’ is just an example. The actual value may be another
number depending on the business claim.
Observe that in the null hypothesis, we always have the ‘=’ sign (i.e. =, ≤ or ≥).
And in the alternative hypothesis, we always do not have the ‘=’ sign (i.e. ≠, <
or >). In addition, when we combine the null and alternative hypotheses, we will
obtain the set of all real numbers.
We will discuss the meaning of these tails later. At the moment, it suffices to be
able to determine the appropriate tail of the test.
Setting Up Hypothesis
https://www.youtube.com/watch?v=R2hxisYFKxM
KHE-LCD-SGD-00039 139
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The following are two examples that illustrate how we could set up the null and
alternative hypotheses.
Example 1:
(i.e. µ = 64.3)
H1: µ ≠ 64.3
(two-tailed)
In Example 1, the claim is ‘µ = 64.3’. Since this claim contains the ‘=’ sign, we
placed it in H0. Therefore, H1 will be the opposite which is ‘µ ≠ 64.3’. Since the
sign in H1 is ‘≠’, this is a two-tailed test.
Example 2:
Claim: The average Statistics exam score of Kaplan students is more than 64.3.
H0: µ ≤ 64.3
(right-tailed)
In Example 2, the claim is ‘µ > 64.3’. Since this claim does not contain the ‘=’ sign,
we have placed it in H1. Therefore, H0 will be the opposite which is ‘µ ≤ 64.3’.
Since the sign in H1 is ‘>’, this is a right-tailed test.
KHE-LCD-SGD-00039 140
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Claim: _______________________________________________
(i.e. )
H0:
H1:
( -tailed)
KHE-LCD-SGD-00039 141
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Claim: _______________________________________________
(i.e. )
3. Last year, the average GPA for Kaplan graduates is 3.2. It is believed that the
GPA for this year graduates has significantly changed.
KHE-LCD-SGD-00039 142
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Once we have set up the hypotheses, we wish to determine whether the null
hypothesis is true or false. To do so, we will collect a sample from the specified
population.
For example, if we claim that the mean monthly handphone bills of Kaplan
students is $50, we can randomly sample 100 students and calculate the sample
mean. If X = 20 then it is difficult to believe that µ = 50. If X = 40 then it is more
likely to believe that µ = 50. If X = 48 then it is very likely that µ = 50.
So, when the value of X gets nearer to the proposed value of µ, we are more
likely to believe that the null hypothesis is true. Exactly at what point between 20
and 50 that we begin to believe that the population mean is believable to be 50?
This cut-off point is known as the Critical Value (Asadoorian & Kantarelis, 2008).
The following picture illustrates the idea of critical values:
KHE-LCD-SGD-00039 143
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
If we are performing a left-tailed test, the small left area will be shaded and the
critical value is on the left. Vice-versa, for right-tailed test, the shaded area is on
the right and the critical value is on the right. For a two-tailed test, there are two
shaded areas and critical values as shown above.
The shaded portion is known as the Rejection Region. We will discuss this in
more detail after we have introduced the concept of type I & II errors.
Actual Situation
Probability 1 ‐ α Probability β
Probability α Probability 1 ‐ β
KHE-LCD-SGD-00039 144
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In hypothesis testing, we usually avoid Type II error by using ‘do not reject H0’
instead of ‘accept H0’. However, we could not avoid type I error since our starting
point is to suspect that H0 may not be true (i.e. to reject H0). Therefore, in
hypothesis testing, there will always be type I error and we will name α as the
Level of Significance of the test (Frost, 2017).
The total area of the shaded tail(s) (left, right or two tails) will be α. So, in the
case of two-tailed test, each tail will be α /2. We can then find the respective
critical values. For example, if the curve is a Z curve, then the critical values for a
two-tailed test are Z / 2 .
We will learn how to compute a value known as the test statistic later. If this
value falls inside the shaded region, we reject H0. Otherwise, we do not reject H0.
In summary, the following are the only two possible conclusions for hypothesis
testing:
KHE-LCD-SGD-00039 145
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We will be introducing two hypothesis tests (Z test and t test) in this topic. These
tests are for the population mean.
You may observe that this is similar to the case when we construct the
confidence interval for population mean. We will discuss each test in detail later.
KHE-LCD-SGD-00039 146
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Now that we have introduced all the necessary concepts, we can summarize it by
stating the five steps involved in any hypothesis testing. Note down these five
steps and keep this in mind when we discuss a few examples of hypothesis tests.
We call this approach of conducting the test as Critical Value Approach since
we make use of the critical value(s) to determine the conclusion of the test.
• Step 1: Hypothesis
• Step 2: α
• Step 5: Conclusion
KHE-LCD-SGD-00039 147
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We shall now introduce the Hypothesis Testing for Population Mean when
the Population Standard Deviation is Known (Weiss, 2017). The test statistic
for this case is as follows:
To perform the Z test, we require the sampling distribution of the sample mean to
be normally distributed. This can be achieved if either the underling population is
normally distributed or the sample size is large. Note that this is the same
condition as the case of finding confidence interval for µ when σ is known.
KHE-LCD-SGD-00039 148
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
Test the claim, at 0.05 significant level, that the mean monthly handphone bills of
Kaplan students is $50. From a sample of 100 students, we obtained a mean of
$52.3. You may assume that σ = $4.8
We will use the five steps mentioned earlier to perform the hypothesis testing.
Step 1:
Claim: the mean monthly handphone bills of Kaplan students is $50 (i.e. µ = 50)
H0: μ = 50 (claim)
H1: μ ≠ 50
(two-tailed)
Step 2:
This step is the easiest. We just need to decide on the level of significance.
α = 0.05.
KHE-LCD-SGD-00039 149
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Step 3:
σ is known
X μ
Z STAT
σ
n
52.3 50
4.8
4.8
100
Step 4:
KHE-LCD-SGD-00039 150
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Step 5:
This is the final step in which we draw a conclusion. Since the value of the test
statistics is in the rejection region, we reject H0 and accept H1. We would now
need to explain the test conclusion in the perspective of our business case.
Recall that our claim is in H0. Therefore, rejecting H0 implies that we reject our
claim. Hence, we have added the last statement as shown below to conclude the
test.
Z - test
LMS Learning Outcome 8.1
https://elearn-diploma.kaplan.com.sg
KHE-LCD-SGD-00039 151
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Test the claim, at 0.01 significant level, that the population mean of a bar of
chocolate is 250g. You may assumed that the weight is normally distributed with
a population standard deviation of 30g.
A sample of 20 bars of chocolate has shown that the average weight is 240g.
KHE-LCD-SGD-00039 152
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We shall now introduce the Hypothesis Testing for Population Mean when
the Population Standard Deviation is unknown (Stine, 2014). The test statistic
for this case is as follows:
KHE-LCD-SGD-00039 153
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
Step 1:
H0: μ ≤ 64.3
(right-tailed)
= 0.10
KHE-LCD-SGD-00039 154
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Step 3:
In this step, we have to decide on appropriate test statistics. Since we are testing
µ, we need to check whether the population standard deviation (σ) is known. In
this case, since it is unknown, we would like to use the t test. However, before we
can proceed further, we need to verify that the underlining population is normally
distributed. This was indeed true since the question has indicated that the scores
are normally distributed.
σ is Unknown (S = 5.2)
X is normal
X μ
t STAT
S
n
64.8 64.3
0 .5
5.2
25
Step 4:
In this step, we will present the rejection rule using the t curve. Since we are
doing a right-tailed test, the area of the right tail is 0.10. Look up the t table and
we can determine that the critical value is 1.318. Remember that for the t test,
the degree of freedom is n – 1 (i.e. 25 – 1 = 24). Next, we indicate the computed
test statistics into the graph. Since the value 0.5 is less than 1.318, we indicated
the value outside the rejection region.
KHE-LCD-SGD-00039 155
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Step 5:
Since the value of the test statistics is not in the rejection region, we do not reject
H0 and cannot accept H1. Remember that we would need to explain the test
conclusion in the perspective of our business case.
Since the claim is in H1 and we cannot accept H1, it implies that we cannot
accept the claim. Hence, we have added the last statement as shown below to
conclude the test.
t - test
LMS Learning Outcome 8.2
https://elearn-diploma.kaplan.com.sg
KHE-LCD-SGD-00039 156
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The SMRT claim that the mean waiting time for train during the morning peak
hour is less than 5 minutes.
A sample of 20 students were interviewed and their mean waiting time was 5.2
minutes with a sample standard deviation of 0.8 minutes.
You may assume that the waiting time for train is normally distributed.
KHE-LCD-SGD-00039 157
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Summary
Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.
KHE-LCD-SGD-00039 158
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
REFERENCES
Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.
Lind, DA., Marchal WG., & Wathen, SA. (2018). Statistical Techniques in
Business and Economics. USA: McGraw-Hill Education.
Stine, R. & Foster, D. (2014). Statistics for Business. USA: Pearson Education
Ltd.
KHE-LCD-SGD-00039 159
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Learning Outcomes
The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:
KHE-LCD-SGD-00039 160
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In order to compare the mean of two populations, we will examine the difference
between the mean (Brenson, 2015). Suppose that we have a population with
mean µ1 and another population with mean µ2. We will examine the difference µ1
- µ2. In that case, we could set up one of the following three hypotheses:
Example:
Xtrem Slimming Pte Ltd has introduced a new slimming pills which claims to be
effective in reducing body weights. Ten participants tried the new pills for one
month and their weights were measured. Do you think the new slimming product
is effective?
(left-tailed)
KHE-LCD-SGD-00039 161
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Claim: _______________________________________________
2. Some people suspect that the IQ scores of Kaplan students is different from
SIM students.
Claim:
KHE-LCD-SGD-00039 162
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
1. Paired t-Test
There are four hypothesis tests for the difference between two population means.
To determine which tests to use, we will consider the following:
1. Is the data related? Can the data be paired? If yes, we will use Paired t-
Test
2. If the data are independent, then we need to check if the population
standard deviations are known:
a. If the population standard deviations are known, we use
Independent Z-Test
b. If the population standard deviations are unknown, we need to
estimate whether these are equal (we will discuss how later):
i. If the population standard deviations are unknown but equal,
we use Pooled t-Test
ii. If the population standard deviations are unknown but not
equal, we use Nonpooled t-Test
We will discuss each of these tests in the following pages. Do a quick recall of
the five steps in hypothesis testing as we will be using it in all these tests.
KHE-LCD-SGD-00039 163
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The first test we are discussing is the Paired t-Test (Asadoorian & Kantarelis,
2008). This is used when the data could be paired. That is, for the experiments
that we do, we could pair the measurements for each experiment by a common
attribute.
Example:
KHE-LCD-SGD-00039 164
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Let's take a look at another example. Consider the situation when we want to
determine whether a brand of slimming pills is effective on adults. We will have
the first population consists of all adults who have not taken the pills. We will
have a second population consists of all adults who have taken the pills. The
best way to examine the effectiveness of the pills is to observe how much weight
a person has lost after taking the pills. To this end, we can have a common
attribute (i.e. same person) when taking measurements. The weight of the same
person before taking the pill will represent the sample data from the first
population. The weight of the same after taking the pill will represent the sample
data from the second population.
KHE-LCD-SGD-00039 165
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
A random sample of 10 married couples gave the data on ages, in years. At the
10% significant level, do the data provide sufficient evidence to conclude that the
mean age of married men is more than the mean age of married women?
1 59 53 6
2 21 22 -1
3 33 36 -3
4 78 74 4
5 70 64 6
6 33 35 -2
7 68 67 1
8 32 28 4
9 54 41 13
10 52 44 8
36
KHE-LCD-SGD-00039 166
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Step 1:
As usual, we start off by stating the claim and then formulate it into mathematical
symbols. We would need to adjust the symbols so that the parameters are on the
left side as shown below. The principles used in setting the hypotheses are
exactly the same as those used in the previous topic.
Claim: The mean age of married men is more than the mean age of married
women
H0: µM - µW ≤ 0
(right-tailed)
Step 2: = 0.10
Step 3:
We have to check that the data are paired and population of differences are
normally distributed. We can draw some inference on the normality of the
differences by looking at the histogram. Once we are satisfied with the
assumption, we can compute the test statistic.
KHE-LCD-SGD-00039 167
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
d
t STAT
Sd
n
3 .6
2 . 29
4.97
10
Step 4:
Next, we will draw the t curve to determine the rejection region. Remember to
indicate the computed value of the test statistic on the graph.
KHE-LCD-SGD-00039 168
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Step 5:
The final step is the conclusion. Remember to comment on your test result from
the business perspective.
Therefore, we accept that the mean age of married man is more than the mean
age of married women.
You might have noticed by now that the above procedure is similar to the t test
we had done in the last topic. In fact, the paired t-test is exactly the same t-test
we had done in the previous topic except we have to do the pairing of data and
covert the two-sample data into a one-sample data by taking differences.
Paired t - test
https://www.youtube.com/watch?v=vB1OmEY5Rcw
KHE-LCD-SGD-00039 169
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Xtrem Slimming Pte Ltd has introduced a new slimming pills which claims to be
effective in reducing body weights. Ten participants tried the new pills for one
month and their weights were measured. At 5% significant level, do you think the
new slimming product is effective?
KHE-LCD-SGD-00039 170
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The second test we are discussing is the Independent Z-Test (Weiss, 2017).
This is used when the two samples are independent and the population standard
deviations are known.
For example, consider the case where we would like to compare the IQ scores of
Kaplan students and SIM students. We could draw a sample of Kaplan students
and a sample of SIM students. There is no way for us to find a meaningful
common attribute to pair the sample. They are obviously independent.
If we also know the population standard deviation of the IQ scores for Kaplan and
SIM students, we could use the test statistics shown above. Note that we also
require both populations to be normally distributed. The Independent Z-Test is
not commonly use since in most situations we do not know the values of the
population standard deviations. Nevertheless, it is good to run through an
example of this test as it provides a good foundation for discussion of the
remaining two tests.
KHE-LCD-SGD-00039 171
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
Test, at 0.05 significant level, the claim that the mean IQ scores of Kaplan
students is different from SIM students. A sample of 20 students from Kaplan
shows an average IQ score of 105.8 while a sample of 28 students from SIM has
an average of 106.3. Assuming that the IQ scores for Kaplan and SIM students
are normally distributed with standard deviations 5.3 and 5.8 respectively.
Step 1:
Claim: The mean IQ scores of Kaplan students is different from SIM students
(i.e. µK ≠ µS so µK - µS ≠ 0)
H0: µK - µS = 0
H1: µK - µS ≠ 0 (claim)
(two-tailed)
Step 2: = 0.05
Step 3:
Independent samples
KHE-LCD-SGD-00039 172
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
x1 x2 105.8 106.3
Z 0.31
2
2 2
5.3 5.8 2
1
2
n1 n2 20 28
Step 4:
Step 5:
Independent Z - test
https://www.youtube.com/watch?v=EwyxQ-yLSbU
KHE-LCD-SGD-00039 173
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We have discussed the Independent Z-test. This test is for the case when the
population standard deviations are known. In the event that the population
standard deviations are unknown, we would need to use t-Test. For the t-Test,
we will need to decide whether we want to pool the variance. The following are
the guidelines:
If the population standard deviations are unknown but not equal, use a
Nonpooled t-Test
SBig/SSmall < 2
KHE-LCD-SGD-00039 174
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The third test we are discussing is the Pooled t-Test (Lind, Marchal & Wathen,
2018). This is used when the two samples are independent and the population
standard deviations are unknown but equal. We will also require both populations
to be normally distributed.
Since both population standard deviations are equal, we will combine the sample
variances and determine the pooled sample standard deviation (denoted by Sp)
using the above formula. After which, we will substitute this value into the test
statistics as shown above. Note that the degree of freedom for the t distribution is
n1 + n2 – 2.
KHE-LCD-SGD-00039 175
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
Test, at 0.05 significant level, the claim that the mean IQ scores of Kaplan
students is different from SIM students. A sample of 10 students from Kaplan
shows an average IQ score of 105.8 with a standard deviation of 5.3. Another
sample of 18 students was selected from SIM and the average IQ scores was
106.3 with a standard deviation of 5.8. Assuming that the IQ scores for Kaplan
and SIM students are normally distributed.
Step 1:
Claim: The mean IQ scores of Kaplan students is different from SIM students
(i.e. µK ≠ µS so µK - µS ≠ 0)
H0: µK - µS = 0
H1: µK - µS ≠ 0 (claim)
(two-tailed)
Step 2: = 0.05
Step 3:
Independent samples
KHE-LCD-SGD-00039 176
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Step 4:
Step 5:
Pooled t - test
https://www.youtube.com/watch?v=1F1Y3SuNz9c
KHE-LCD-SGD-00039 177
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The fourth test we are discussing is the Non-pooled t-Test (Stine, 2014). This is
used when the two samples are independent and the population standard
deviations are unknown but NOT equal. We will also require both populations to
be normally distributed.
You may have observed that the test statistics for the Nonpooled t-Test is very
similar to the Independent Z-Test except that we have switched the population
variances to sample variances. The degree of freedom for the t distribution is
rather complicated as you can observe from the above formula.
KHE-LCD-SGD-00039 178
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
Test, at 0.05 significant level, the claim that the mean IQ scores of Kaplan
students is different from SIM students. A sample of 10 students from Kaplan
shows an average IQ score of 105.8 with a standard deviation of 7.3. Another
sample of 18 students was selected from SIM and the average IQ scores was
106.3 with a standard deviation of 2.8. Assuming that the IQ scores for Kaplan
and SIM students are normally distributed.
Step 1:
Claim: The mean IQ scores of Kaplan students is different from SIM students
(i.e. µK ≠ µS so µK - µS ≠ 0)
H0: µK - µS = 0
H1: µK - µS ≠ 0 (claim)
(two-tailed)
Step 2: = 0.05
Step 3:
Independent samples
KHE-LCD-SGD-00039 179
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
2
105.8 106.3 7.32 2.82
t 10
18
7.32 2.82 df
2
7.32 2.82
2
10 18
10 18
0.21 10 1 18 1
10.49 10 (always round down)
Step 4:
Step 5:
KHE-LCD-SGD-00039 180
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
The Statistics exam scores of a group of Kaplan students were recorded and
group by gender as follows:
Sample Size 10 12
Mean 68.3 64.2
Standard Deviation 4.8 5.6
KHE-LCD-SGD-00039 181
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Summary
Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.
Paired t-Test
KHE-LCD-SGD-00039 182
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
REFERENCES
Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.
Lind, DA., Marchal WG., & Wathen, SA. (2018). Statistical Techniques in
Business and Economics. USA: McGraw-Hill Education.
Stine, R. & Foster, D. (2014). Statistics for Business. USA: Pearson Education
Ltd.
KHE-LCD-SGD-00039 183
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
So far we have been discussing the analysis of statistics for one variable. For this
topic, we shall examine the relationship between two variables. In particular,
we are interested in the linear (or straight line) relationship between two
variables.
This discussion can be generalised into multiple variables for more advance
topics in Statistics and are usually done with statistical software. As an
introductory level, we will discuss Regression Analysis and Correlation
Analysis for two variables in this topic.
Learning Outcomes
The following are the learning outcomes for this topic. At the end of the topic, do
a tally and ensure that you have achieved these outcomes:
KHE-LCD-SGD-00039 184
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We shall begin this topic with a quick review on the equation of straight line
(Mathcentre, 2009).
y = mx + c
KHE-LCD-SGD-00039 185
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
Cost/day
Volume/day (S$)
(L) Cost per Day vs. Production Volume
23 125
250
26 140
200
Cost per Day
29 146 150
33 160 100
50
38 167
0
42 170 20 30 40 50 60 70
55 195
60 200
KHE-LCD-SGD-00039 186
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In this example, the independent variable (X) is volume and should be labelled
on the horizontal axis. The dependent variable (Y) is cost and should be labelled
on the vertical axis. We dot each pair of data on the diagram but do not connect
the points.
A scatter plot is useful tool to identify potential linear (straight line) associations
between the two variables. Occasionally, the scatter plot may not show a
potential linear relationship. In such situation, we may have to do a change of
variable such as considering Y against x, x 2 , ln x , etc.
Scatter Plot
https://www.youtube.com/watch?v=NcgRa0uotXs
A Simple Linear Regression Model shows the straight-line relation between the
dependent variable (Y) and the independent variable (X) with some random
fluctuation around the line (Berenson, Levine & Szabat, 2015).
KHE-LCD-SGD-00039 187
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Supposedly we are able to plot all the possible pairs of (X,Y) on the scatter plot
as shown:
This becomes a scatter plot for the population. If we are able to fit a straight line
through these dots, we can use this line to explain the linear regression model.
Note that Y 0 1 X is an equation of a straight line. The slope of the line is
𝛽 and the y-intercept is 𝛽 .
Note that not all the points fall on the straight line. If you look carefully at the
linear regression model Yi 0 1 Xi i , it is actually the equation of the
straight line plus an 𝜀 (pronounce as epsilon). This represents a small error.
KHE-LCD-SGD-00039 188
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
In actual practice, we only have a sample of pairs of (X,Y) data. Based on the
sample, we will fit these points into an Estimated Regression Line 𝑌 𝑏
𝑏 𝑋.
Firstly, note that we use b0 and b1 instead of 𝛽 and 𝛽 . This is because we are
using sample data to estimate the regression line. So, b0 and b1 are estimates of
𝛽 and 𝛽 respectively. The symbol 𝑌 (pronounce as Y-hat) represents the
estimated average value of Y.
KHE-LCD-SGD-00039 189
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We have briefly described the principle of least square method in the previous
page. This method will result in the above formulas that will be used to determine
the values of b0 and b1. The actual derivation of these formulas requires calculus
and is not within this syllabus. It suffices to be able to apply these formulas. We
will illustrate this by an example.
Example:
KFC needs to order oil daily for its fried chicken. In a particular KFC store, the
volume of oil used and the respective cost for nine days were recorded as
shown. Use the data provided to determine the regression equation.
KHE-LCD-SGD-00039 190
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
b0 y b1 x
165.7 1.9 39.6
90.5
Yˆ b0 b1 X
Yˆ 90.5 1.9 X
KHE-LCD-SGD-00039 191
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
b) Predict the average score of students who have studied for 25 hours.
KHE-LCD-SGD-00039 192
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
1 10 32
2 13 35
3 15 40
4 19 49
5 22 53
6 25 58
7 31 64
8 35 70
9 39 74
10 43 81
TOTAL
KHE-LCD-SGD-00039 193
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
We have learnt how to determine the values for b0 and b1 and write down the
linear regression equation. We shall now learn to interpret these values.
Example:
The slope is 1.9. This means that the estimated mean cost is increased by $1.9
for every additional one litre of oil used.
KHE-LCD-SGD-00039 194
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
KHE-LCD-SGD-00039 195
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
S xy
r
S xx S yy
-1 r 1
KHE-LCD-SGD-00039 196
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Note that the correlation coefficient only measures the strength of the linear
relationship. The two variables may have a strong non-linear relationship but
the coefficient of correlation could be near to zero.
The following examples illustrated that when the sample data are close to the
straight line, the value for r is nearer to -1 or 1. These are examples of strong
linear correlation.
KHE-LCD-SGD-00039 197
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
When the data are scattered far apart from the straight line, the value of r will be
nearer to zero which indicates a weak linear relationship.
Correlation Coefficient
https://www.youtube.com/watch?v=ugd4k3dC_8Y
KHE-LCD-SGD-00039 198
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Example:
Using the same KFC example discussed earlier, determine and interpret the
correlation coefficient.
S xy 2662.7
r 0.98
S xx S yy (1386.2 5290)
This means that there is a strong positive linear relationship between the cost
and the volume of oil used.
KHE-LCD-SGD-00039 199
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Continue from the previous Class Activity on Statistics Exam Scores, determine
and interpret the correlation coefficient.
KHE-LCD-SGD-00039 200
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
Summary
Can you recall what you have learned in this topic? For each sub-topic listed
below, try to provide some pointers to consolidate your learning.
Scatter Plot
Interpretation of Slope
Correlation Coefficient
KHE-LCD-SGD-00039 201
BUSINESS STATISTICS AND DATA-DRIVEN DECISION MAKING
REFERENCES
Berenson, M., Levine, D., & Szabat, K. (2015). Basic Business Statistics –
Concepts and Applications. Australia: Pearson Education Ltd.
Chartio. (2018). What is a Scatter Plot and When to Use It. Retrieved from
https://chartio.com/learn/dashboards-and-charts/what-is-a-scatter-plot/
KHE-LCD-SGD-00039 202