Business Statistics Book - OCR
Business Statistics Book - OCR
eee
oye ». Normal Carve ® ase ae
sia ‘Sample: “Variance
Kc
ve Histoogram ‘D: ancientsi
ance ‘Normal Cane ocr foram
Histoo
gario
a Nari =D Danie
Variance
Seas Varlance
=enst Son sai «Da Census ve
ne“Hpivinta
ang
weSal ta Sos
x8ae a es
NSW.
a . 3— ea ae
SAIN
wel
ae
Spee asi
meen rikBCS oH
“ys“ Sa
<aesseNO rmal
5 Revit’ Variance.
<u
Curve Mi
P
“Hypothesis Medi ve
a ee
Da
ala Skewiiess.. Kurtosis ”
Institute of
=” | Management Technology
IMT | Centre for Distance Leaming, Ghaziabad
eeee
VISION
Imparting continuum of management education through distance mode to learners across the globe.
MISSION
Be an academic community leveraging technology as a bridge to innovation and life-long learning.
To continuously evolve management competencies for enhanced employability and entrepreneurship.
To serve society through excellence and leadership in management education, research and consultancy.
Institute of OPMCO01
Management Technology Business
IMT | Centre for Distance Leaming, Ghaziabad Statistics
e000
INDEX
UNIT1
Defining and Collecting Data 1
UNIT2
Organizing and Visualizing Variables 18
UNIT 3
Numerical Descriptive Measures 37
UNIT4
Basic Probability 52
UNITS
Discrete Probability Distribution 65
UNIT 6
The Normal Distribution and Other Continuous Distributions 108
UNIT 7
Sampling Distributions 118
UNIT 8
Fundamentals of Hypothesis Testing: One-Sample Tests 126
UNIT9
Two-Sample Test 137
UNIT 10
Analysis of Variance 147
UNIT 11
Chi-Square Test 171
UNIT 12
Simple Linear Regression 194
UNIT 13
Simulation 221
UNIT 14
Index Number 242
ACKNOWLEDGEMENT
Publisher Address: A-16, Site 3, UPSIDC Industrial Area, Meerut Road, Ghaziabad
Printed by: Utility Forms Pvt. Ltd., A-23/B-1, Mohan Cooperative Industrial Estate, Mathura Road,
New Delhi- 110044; Phone No: 011-46757575: E-mail: [email protected]
ISBN: 978-81-951960-6-7
Allrights reserved. No part of this work may be reproduced in any form, by mimeography or any other means,
without permission in writing from Institute of Management Technology, Centre for Distance Learning.
Further information on Institute of Management Technology, Centre for Distance Learning may be obtained
from Institute’s Head Office at Ghaziabad or www.imtcdl.ac.in
OPMC001
Business Statistics
STRUCTURE
1.0 Objectives
1.1 Introduction
1.2 Defining Variables
1.3 Measurement Scales
1.4 Collecting Data
1.5 The Methods of Data Collection
1.6 Different Ways of Collecting Samples
1.7. LetUsSumUp
1.88 KeyWords
1.9 References and Suggested Additional Readings
1.10 Self-Assessment Questions
1.11 Answers to Self-Assessment Questions
1.12 Check Your Progress — Possible Answers
1.0 OBJECTIVES
After reading this unit, you will be able to:
define variables
understand different measurement scales
1.1 INTRODUCTION
The business world of today is so intimately aligned with the universe of data and statistics
that it is impossible for a business or management professional to think of operations
disentangled from data processes and systems. Jim Gray, winner of the prestigious Turing
Award imagined the World to be data-driven. In this scenario, we can ignore data only at our
UNIT 1
Defining and Collecting Data
peril as it has become increasingly evident that collecting facts and figures and defining
variables for the business are crucial processes. This Unit discusses the meaning of data,
associated variables, scales of statistical measurements, and related processes. It takes you
on a guided tour into the world of data and statistics and explains the essential features of
their utility and application for management professionals.
As an initial exercise to understand the properties of Data and Statistics, you may attempt to
answer the following two questions:
You are the Sales Manager incharge of the best-selling washing machine in its category.
For years, your chief competitor has been making incremental sales gains, claiming a
better washing machine. Worse, a new sibling product from your company, known for
its good quality, has rapidly gained significant market share at the expense of your
product. Worried that your product may soon lose its number one status, you seek to
improve sales of the product by improving its after-sales Service. You experiment and
develop a new after-sales service process. You conduct surveys and discover that
people overwhelmingly like the newer formulation, and you decide to use the new
formulation since statistical evidence has shown that people prefer the new
formulation. What could go wrong?
You may now realize that much did go wrong. The above case tells us that if we choose the
wrong variables, we may notend up with results that support making better decisions.
As an initial note, Statistics is a way of thinking that can help fact-based decision-making.
Statistics is the branch of mathematics that transforms numbers into useful information for
decision-makers. Statistics provides a way of understanding and then reduces— but does not
eliminate — the variation that is part of any decision-making process. Statistical data also tell
us the known risks associated with decision-making. Statistics achieves this by providing a
set of methods for analyzing numbers. These methods help us to find patterns in numbers
and enable us to determine whether differences in the numbers are merely due to chance.
As we progress in this unit, we will learn these methods and will also learn the appropriate
conditions for using them.
e Variables are numbers, amounts, or situations that can change, that are not rigid
or static, and have the propensity to transform, modify, or shift from their initial
OPMCO001
Business Statistics
locus. The idea of a variable originated from the work of the French
mathematician Francois Viete towards the end of the sixteenth century. He
brought into practice the method of representing known or even unknown
numbers, by letters and practicing computation with them as if they were
numerical entities.
Categorical Variables are also known as qualitative variables and are values that
can be categorized, like a ‘yes’ ora ‘no’.
Do you currently own stocks and bonds? “Did you buy any of the shoes advertised
in the flyer with today's newspaper?”, are examples of categorical variables, all of
which have 'yes' or 'no' as their values. Categorical variables can have more than
two possible responses. For example, asking customers to indicate the day of the
week on which they made their purchases.
Question. Gender with its categories male and female is an example of a
These are also Known as quantitative variables that carry values, representing
quantities. For example, the response to the question “How much money do you
expect to spend ona stereo?” is anumerical variable.
Numerical variables are further subdivided as discrete or continuous variables.
Discrete variables have numerical values that arise from a counting process. The
number of magazines subscribed to is an example of a discrete numerical variable
because the response is one of a finite number of integers. You subscribe to
number zero, one, two, and soon. The number of items that a customer purchase
is also a discrete numerical variable because we are counting the number of items
purchased.
Continuous variables produce numerical responses that arise from a measuring
process. The time you wait before an ATM generates the requested cash or the
time spent waiting for a web page to load or the temperatures of a body are
examples of continuous numerical variables because the responses take on any
value within a continuum or interval, depending on the precision of the
measuring instrument. For example, your waiting time could be 1 minute, 1.1
minutes, 1.11 minutes, or 1.113 minutes, depending on the accuracy of the
measuring device you use. The following simple fill in the blanks will further
clarify the nature and meaning of numerical variables:
UNIT 1
Defining and Collecting Data
o Areyoucounting or measuring?
In Table 1.1 we present examples of measurement scales, some of which are used
in the remaining parts of this section. We define numerical variables as using
either an interval scale, which expresses a difference between measurements
that do not include a true zero point, or a ratio scale or an ordered scale that
includes a true zero point.
If a numerical variable has a ratio scale, we can characterize one value in terms of
another. We can say that the item cost (ratio) Rs 2 is twice as expensive as the item
cost Re 1.
For both interval and ratio scales, what the difference of 1 unit represents
remains the same among pairs of values, so that the difference between Rs 11
and Rs 10 represents the same difference as the difference between Rs 2 and Re1
(and the difference between 11°C and 10°C represents the same as the difference
between 2°C and 1°C).
Categorical variables use measurement scales that provide less insight into the
values for the variable. For data measured on a nominal scale, category values
express no order or ranking. For data measured on an ordinal scale, ordering or
ranking of category values is implied. Ordinal scales provide to us information to
compare values but not as much as interval or ratio scales. For example, the
ordinal scale poor, fair, good, and excellent provides us the knowledge that
“good” is better than poor or fair, and not better than excellent. But unlike
interval and ratio scales, we do not know that the difference whether “poor to
fair” is the same magnitude as "fair to good” or “good to excellent”
OPMC001
Business Statistics
The numbers or the data collected are meaningless unless all variables have operational
definitions. These definitions have universally accepted significance, clear to everyone
associated with the process of analysis. Even though the operational definition for sales per
year might seem clear, miscommunication could occur if one person was referring to sales
per year for the entire chain of stores and another to sales per year per store, or one person
is measuring in 4-4-5 weeks for the quarter and another person till the end of the month. It is
imperative to familiarize oneself with the basic vocabulary of statistics as discussed below:
Variables:
Population (N):
“N” symbol is used to represent the population. Four other basic vocabulary terms are
population, sample, parameter and statistics. A population consists of all the items or
UNIT 1
Defining and Collecting Data
NOTES individuals about which we wish to acquire information. Data of the sales transactions of an
electronics goods store, the number of customers who shopped at a city mall during a
specific weekend, the number of students who enrolled for a part-time course at a particular
college in a particular year, the registered voters of a town called Ramgarh, are examples of a
population.
Sample (n):
“n” symbol denotes a sample. A sample is the portion of a population selected for analysis.
Therefore, if we say that 200 sales transactions of the electronic goods store are randomly
selected by an auditor for study or 30 shoppers at the mall are asked to complete a customer
satisfaction survey, or 50 part-time students are selected for a marketing study, or 500
registered voters in Ramgarh are asked whom they voted for, we are referring to asample. In
each sample, the transactions or people in the sample represent a portion of the items or
individuals that make up the population.
Parameter:
A parameter is a numerical measure that describes the character of a population. The mean
amount spent by all customers who shopped at the city mall during the weekend is an
example of a parameter because the amount spent by the entire population is what we are
looking for. In contrast, the mean amount spent by the 30 customers completing the
customer satisfaction survey is an example of a statistical survey because the amount spent
by onlya sample of 30 people is required.
1.4.2 DATACOLLECTION
e Capturing revenue and profit figures from the balance sheet of a business
organization.
Market research firms and trade associations distribute data on specific industries or
markets. Investment services provide financial data on a company-by-company basis.
Syndicated services such as AC Nielsen provide clients with data that enables the
comparison of the market share of client products with those of their competitors. Daily
newspapers are filled with numerical information regarding stock prices, weather
conditions, and sports statistics.
Designed Experiment:
Outcomes of a designed experiment are another data source. These outcomes are the result
of an experiment, such as a test of several laundry detergents to compare how well each
detergent removes a certain type of stain. Developing proper experimental designs is a
subject beyond the scope of this study material because such designs often involve
sophisticated statistical procedures.
Advantages
PP
PWN
WN
PWNHPIB
Disadvantages
PWNHeE|
e Census and Sampling Methods:
Primary data becomes highly necessary whenever the secondary data is not
available. The primary data can be obtained either by the census method or by
the sampling method.
e Census Method
When the researcher collects data from every member of the population (N), it is
referred to as the census method or complete enumeration method. For
example, censuses on every individual are conducted every ten years in India.
Advantages
¢ Information regarding each member of the population can be obtained. The
information collected is more accurate.
Disadvantages
e Itrequires a lot of time and a huge amount of money and can easily result in
duplication, exclusion, etc. because of the humongous effort involved.
e Sampling Method
When the researcher collects data from a few individuals from the population,
they are known as the respondents and constitute the sample (n). It is referred to
as the sampling method. The process of collecting the primary data from the
sample is known as a Survey. E.g. AC Nielson conducts surveys on market shares
OPMCO001
Business Statistics
While “Google Forms” surveys are typically sent and answered via email, we can
also get respondents to fill in answers on a web page, embed the questionnaire
ona site and share it via social media. Here are our step-by-step instructions for
how to create a survey with Google Forms.
Navigate to https://docs.google.com/forms/ and click Blank. Google Forms has several pre-
made templates to choose from, and we can view them all by clicking More.
rey ©
F
EELS er od etn ale) PAT las A T-Shirt Sign Up
Name your survey. You can also add a description. If you wish to name the Google
Form for your reference, click untitled form in the top left corner and edit.
Tap on untitled question and compose a question.
e Multiple choice lets users select one answer from a series of options, while
Checkboxes allows users to select multiple answers.
¢ Dropdown provides recipients a field to click that reveals a menu they would
select an answer from.
¢ The linear scale allows users to answer by selecting a rating from a range such as 1
to5.
e Date andtime allow recipients
to selecta date or time.
Dietary Restrictions
I'm trying to plan a group meal, so let me know what you can and car't est.
10 Click the Required switch to make a question mandatory. Click the duplicate or
OPMCO001
Business Statistics
D OD | Required
Now that your survey is sent, your audience is expected to answer. To view what
your recipients said, click on Responses.
Known, non-zero probability for every element to be included in the sample. Each element
of the population has an equal chance to get selected. Type of probabilistic sampling are:
NOTES * Systematic Sampling: A simple process. Every nth name from the list will be
drawn.
12
Exercise on MS Excel:
Simple Random Sample Key Technique:
Home iment Page Layout Fermelas = Outy view = View Help LP Semch ure «Cl Comments
See] acele- | owt cre | st | SS | Ev | F
OF O/B} as Ay) FS SS | Mun ce | $9 eT | Soe eeeCe | eet ete | Oe teereeete |
fon I . uaner Spee Cate Ln J ieee a
+; _;_|_{_|___§&
:
Exercise:
13
UNIT 1
Defining and Collecting Data
Type of
variables and
examples
1.7 LETUSSUMUP
In this unit, we learned the detail of defining and collecting data, the type of variables.
Statistics is the science of collecting data, analyzing, presenting, and interpreting data. Data
consist of facts and figures. We learned the methods of data collection and specifically
learned the vocabulary of population, sample, and sampling methods.
1.8 KEYWORDS
Data: The facts and figures collected, analyzed and summarized for presentation and
interpretation.
Nominal scale: The scale of measurement for a variable when the data are labels or names
used to identify an attribute of an element. Nominal data may be numeric or non-numeric.
Ordinal scale: The scale of measurement for a variable if the data exhibit the properties of
nominal data and the order or rank of the data is meaningful.
Levine, David M., Stephan, David F., Statistics for Managers Using Microsoft Excel, 8" edition
by Pearson Education.
The oranges grown in corporate farms in an agricultural state were damaged by some
unknown fungi a few years ago. Suppose the manager of a large farm wanted to study the
impact of the fungi on the orange crops daily over 6 weeks. On each day, arandom sample of
14
OPMCO001
Business Statistics
orange trees was selected from within a random sample of acres. The daily average number
of damaged oranges per tree and the proportion of trees having damaged oranges were
calculated. The two main measures calculated each day (i.e., the average number of
damaged oranges per tree and the proportion of trees having damaged oranges) are called
and
Times of India poll asked 2,150 adults in India, a series of questions to find out their view on
the Indian economy.
c) theprimary datasource
d) thesecondary datasource
Q.3 Referring to Scenario 1, the possible responses to the question "How satisfied are you
with the Indian economy today with 1 = very satisfied, 2 = moderately satisfied, 3 =
neutral, 4= moderately dissatisfied and 5 = very dissatisfied?” are values froma
a) discrete variable
b) continuous variable
c) ordinalvariable
15
UNIT 1
Defining and Collecting Data
Q5 Referring to Scenario 1, the possible responses to the question "How would you rate
the condition of Indian economy with 1 = excellent, 2 = good, 3 = decent, 4 = poor, 5 =
terrible?" resultin
a) anominalscale variable
b) anordinal scale variable
Q.6 Referring to Scenario 1, the possible responses to the question "Are you 1. Currently
employed, 2. Unemployed but actively looking for a job, 3. Unemployed and quit
looking fora job?" resultin:
a) anominalscale variable.
b) anordinal scale variable.
Q.7 Referring to Scenario 1, the possible responses to the question "In which year do you
think the last recession in India started?" resultin
a) anominal scale variable.
d) aratioscale variable.
Qa8 Referring to Scenario 1, the possible responses to the question "On the scale of 1 to
100 with 1 being extremely anxious and 100 beings not anxious, rate your level of
anxiety in this Indian economy" results in:
a) anominal scale variable.
b) apopulation.
c) aprimary datasource
d) asecondary datasource
16
OPMC001
Business Statistics
Q.10 The portion of the universe that has been selected for analysis is called
a) asample
b) aframe
c) aprimarydatasource
ANSWER: statistics
ANSWER: parameters
Qs b
Q6 a
Q7 ¢
Qs b
Q9 b
Q.10 a
17
ORGANIZING AND VISUALIZING
VARIABLES
STRUCTURE
2.0 Objectives
2.1 Introduction
2.2 Organizing Categorical Variables
2.3. Organizing Numerical Variables
2.4 Visualization or Categorical Data
2.5 Visualizing Numerical Variables
2.6 Visualizing Two Numerical Variables
2.7. Organizing
and Visualizing a Mix of Variables
2.8 TheChallenge in Organizing and Visualizing Variables
2.9 LetUsSumUp
2.10 KeyWords
2.11 Self-Assessment Questions
2.12 Answers to Self-Assessment Questions
2.13 Check Your Progress—Possible Answers
2.0 OBJECTIVES
2.1 INTRODUCTION
In this unit, we will learn about the organization and visualization of data in systematic
statistics. Basically, facts and figures are called data. In statistics a comprehensive process
describes collection of data and organizing the datain an understandable and readable form.
This unit introduces tabular and graphical methods of commonly used summaries of both
18
OPMC001
Business Statistics
Categorical and numerical variables. Tabular and graphical summaries of data are found in
the annual reports, newspaper articles, business magazines, reference books, etc. We are all
exposed to different types of variables and their measurements. In the previous unit, the
term “statistics” was defined as the collection, organization, analysis, interpretation and
presentation of Data. We begin with tabular and graphical methods for summarizing data
concerning a single variable. The section 2.7 introduces the methods of summarizing data of
Mix variables.
Field work:
Take a round in your colony or your IMTCDL campus. Find out how many types of trees you
can see there. Do you know their names? You can make drawings. Use tally marks to note the
number of different trees.
Avariable isa symbol (e.g., X, Y oro) that represents any of a specified set of values.
For example, suppose variable X represents the percentage of defective units in a shipment
of widgets. Since Xis a percentage, the variable
X could take on any value between O and 100.
Variables can be classified as categorical (aka, qualitative) or quantitative (aka, numerical).
Categorical variables take on values that are names or labels. The colour of a ball (e.g., red,
green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) would be examples of
categorical variables.
Atabular summary of data showing the number (or frequency) of items in each of the several
non-overlapping classes is called frequency distribution.
Let us look at the raw data from the unit2_workbook_ex1 as:
19
UNIT 2
Organizing and Visualizing
Variables
To develop a frequency distribution for these data, we count the number of times each soft
drink appears in the Table 2.1 and summarize it in the frequency distribution table as Table
2.2. The number counted pertaining to each type of drink is called the frequency, and the
table in which a column contains the frequency is called the frequency distribution.
The frequency distribution provides a summary of how the 50 soft drinks were purchased by
Ram. This distribution offers the consumption pattern.
Howto make frequency distribution table using excels:
Step 3: Enter
= COUNTIF (S$B$5: $D$21, H5)
Step 4: Copy cell 15 to cell 16 to 19.
20
OPMCO001
Business Statistics
Arelative frequency is the fraction or proportion of items belonging to a class. For a class set
of total n observation, the relative frequency is calculated as:
Now, let us try to calculate relative frequency and the percent frequency of the frequency
distribution of Table 2.2.
We must divide frequency of each class by the total frequency to get the relative frequency
and the dataset or the table in which a column contains the relative frequency known as
relative frequency distribution table as
Quantitative variables are numerical. They represent a measurable quantity. For example,
when we speak of the population of a city, we are talking about the number of people in the
city - a measurable attribute of the city. Therefore, population would be a quantitative
variable.
21
UNIT 2
Organizing and Visualizing
Variables
Some examples will clarify the difference between discrete and continuous variables.
Suppose the fire department mandates that all fire fighters must weight between 150 and
250 pounds. The weight of a fire fighter would be an example of a continuous variable; since
a fire fighter's weight could theoretically take on any value between 150 and 250 pounds,
even though it may not practically encompass all values.
Suppose we flip a coin “n” number of times and count the number of heads. The number of
heads could be any integer value between 0 and plus infinity. However, it cannot be any
number between O and plus infinity—for example 2.5 heads. Therefore, the number of heads
must be a discrete variable.
In simple word, we can say the data which come from measurement are called Continuous
Variables, and the data which come from counting are called Discrete Variables.
Frooti, a most popular product of Parle AGRO got a regular complaint on the CRM that the
Frooti 160ml pack is packing less than 160 ml frequently. Then, as a Business Manager of the
company, we have to collect random samples from different retailers. Suppose, we collected
20 samples of tetra pack Frooti from the market, measured the volume of each pack, and
noted themas follows:
22
OPMCO001
Business Statistics
The three-steps necessary to define the classes for the frequency distribution with
quantitative data are:
1. Determine the number of non-overlapping classes.
In our case we use MAX function of Excel and found the largest data value, then we used MIN
function of excel and found the smallest data value, the number of classes is already decided
as5.
Now to develop the table we have to incorporate the smallest value in the first class , starting
with 146 and then add 4 i.e. 150 (since it is inclusive of both numbers).
So, our first class is 146-150, the second class is 151-155 and soon.
The Class limit must be chosen so that each data item belongs to one and only one class. In
case of Qualitative Data, chances of overlapping are not there, therefore the class limit
decision is not required in case of categorical data.
e & (FREQUENOHT
ET £12 C129)
a A ‘ c ° £ ‘ 6 ” Ef 4 x i a n ° ’ {s)
1
2) athe:§
2 Feont! [ B50 Mi pack) weight mesured 1 Table 6 mabe
‘4 Pet howe
a| ‘‘ i
i
"4 ‘ ;|
R
a z :
M) Largest data value + i” of
Ss Semiiest Osta Valves ae ® |
Cy Table? 2
v7 Chass Intervet as . Frequency Distritution of Frooti Tetra Pact Deta
.
* [hess than ¢¥ than oF
» 16-15 rr i i
n 231-6 1 % h
2 136-163 nas 2 st x |
5.a : [mt | ot | ceet | ed | ot | Petcare | | @
161-66
eS 163.5)
2 4
ee C
E=
fave a * s + om
Exercise:
Calculate Relative frequency and Percent frequency of the data set in below table:
Table 2.7: Frequency Distribution of Frooti Tetra Pack Data
24
OPMCO0L
Business Statistics
2. Plecharts
3. Pareto Chart
are used to visualize:
Bar Graphs
and Pie charts:
Bar graphs is a graphical device for summarizing the frequency distribution in the form of
elther a horizontal bar or vertical bar. We portray below the different classes along x — axis
and frequency along Y-axis. For easy understanding, we provide the bar chart of the above
illustration for easy reference.
Fig. 2.1: Bar Chart for the Categorical Data
m number
For example, In the above table, the angle for the Pepsi is calculated as : (8/50) X360 = 57.6
degree.
The details of the same data represented in a Pie Chart is illustrated in the Excel working
sheet unit2_workbook_ex1 is, shown as Fig. 2.2.
Fig. 2.2: Pie Chart
a aE
Diet Coke
26%
The Pie Chart gives us more interesting results since we can easily form an idea of the
proportions just by looking at the chart, e.g. 26 % for Diet Coke. In the pie chart, we first draw
a circle to represent all the data. Then we use relative frequencies to subdivide the circle into
sectors, or portions that correspond to the relative frequency of each class. Then, convert
the relative frequency into their corresponding angle by using the formula
Pareto Chart
The frequency of each category is plotted as a vertical bar in the descending order of
frequencies and combined with a cumulative percentage line on the same chart as shown in
the Fig. 2.3.
26
OPMCO001
Business Statistics
IRR E
TERRELL
Diet Coke Dr. Pepper Coke Classic Sprite pepsi
2. Histogram
3. Cumulative Distributions/Ogive
We are using the Frooti case (Unit2_ workbook_ex2) for developing the dot plot and scatter
diagram. In excel following step must be followed:
Scatter Plot
OOOO
BR wa
FF
I
Histogram
Histogram is a commonly used graphical representation of quantitative/numerical data. A
histogram is constructed by placing the variable of interest on the horizontal axis and the
frequency, relative frequency and percent frequency on the vertical axis as shown
in the Fig.
2.5.
Most import e of
a histogram is to provide informati abn t the shape of the dat
If the shape oihetas togram is iushwamediicetes camera whe let ttchoelocs
set is tilted toward right than we call it a right- ieee aie . Best data set is the
symmetrical shape of a histogram or sana curve. asad nes us the fda picture
of the data. It also helps in deciding whether the distribution is normal or not.
Fig. 2.5: Histogram
Histogram
9
8
7?
6
5
4
3
2
1
9 Dol
146-151 151-156 156-161 161-166 166-171
A histogram for the Frooti data can be made using Excel as follows:
Step 1: Select the class data set.
Step 2: Press the CTRL key and select the frequency data set.
Step 3: Click the Insert tab on the Ribbon.
Step 4: In the chart group, click Column/2D column, then chart layout 8.
Step 5: Select the bars, right-click, and choose Format Data Series.
Step 6: On the Format Data Series pane, set the Gap Width to zero.
The cumulative Percentage Polygon
This visual presentation is also known as ogive. This is just like a scatter diagram in which
Midpoint of the class is on the x-axis and percent cumulative frequency along Y-axis.
Cumulative frequencies are of two types, one called less-than-cumulative frequency and
other called more-than-cumulative frequency.
Inthe table above, we are able to notice at one glance how many packs are having weightless
than 151, and itis 1. In the next column, we find how many packs have weight less than 156,
itis 1 agaln. Similarly, we know the number having weight less than 161 as 10 and so on. The
next column contains the number of packs with weight more than 146, and soon.
We can plot the CF (cumulative frequency) distribution chat or even the percent cumulative
distribution as shown in Fig. 2.6.
vt 7 |_
145 150 155 160 165 170
29
UNIT 2
Organizing and Visualizing
Variables
Suppose that we want to introduce an alternate irrigation equipment in Rajasthan. For that
purpose, we collected the secondary data pertaining to yearly rainfall from meteorological
department database and annual Paddy production data from the agriculture ministry site
as:
Ex-3
1 123 678
2 134 654
3 156 698
4 167 643
5 134 690
S
P—
=}be
=)
>
a
o
io
(
be
[a
fa}
<r
[os
60 80 100 120 140 160 180
Kindly refer the Unit2_workbook_ex3. We took the rainfall data along x-axis and Paddy
production data along Y- axis, and the Fig. 2.7 is the scatter plot of the two variables.
From the shape of the graph, itis obvious that there is no correlation.
30
OPMC001
Business Statistics
In excel we use Pivot Table for this purpose. A detail tutorial is attached as the endnote.
Information overloading, presenting too many details can hamper our decision-making. The
over use of data confuses the decision-makers and is known as obscuring data.
Some people add decorative elements to enhance or replace the simplicity of the chart and
Graphs which further provides a false impression known as a Chart junk.
2.9 LETUSSUM UP
We are trying to sum up the discussion with the help of a flow diagram (which is also a type of
visualization tool) as
Data
| |
Categorical Numerical
Data Data
eee:
Comulative frequency Distribution}
| | posiamoey | | eee ||
Commulative frequency Distribution)
Steer
This chapter introduces the role of statistics in turning data into information. Businesses use
statistics to summarize and draw conclusions from data, to make reliable forecasts, and to
improve business processes. The chapter discusses data collection and the various types of
data used in business. We also learned how to draw tables and charts that are appropriate
for categorical and numerical variables and to draw conclusions from them. Pie charts, 31
UNIT 2
Organizing and Visualizing
Variables
histograms, and other graphical methods that enable decision-making were discussed.
2.10 KEYWORDS
Frequency distribution: A tabular summery of data showing the number of data values in
each of the several non overlapping classes.
Some the key words in this unit is Bar Chart, Pie chart, Relative Frequency.
Q.1 Asurvey of 150 executives were asked what they think is the most common mistake
candidates make during job interviews. Six different mistakes were provided. Which of
the following is the best for presenting the information?
a) Abar chart
b) Ahistogram
c) Astem-and-leaf display
d) Acontingency table
Q.2 You have collected information on the market share of 5 different search engines used
by Indian Internet users in a particular quarter. Which of the following is the best for
presenting the information?
a) Apie chart
b) Ahistogram
c) Astem-and-leaf display
d) Acontingency table
Q.3 You have collected information on the consumption by the 15 largest coffee-
consuming nations. Which of the following is the best for presenting the shares of the
consumption?
a) Apie chart
b) APareto chart
d) Acontingency table
NOTE: Even though a pie chart can also be used, the Pareto chart is preferable for separating
the “vital few” from the “trivial many”.
Q.4 You have collected data on the approximate retail price (in $) and the energy cost per
year (in $) of 15 refrigerators. Which of the following is the best for presenting the
data?
32
OPMCO001
Business Statistics
a} Apiecharts
b) Ascatter plots
Q.5 You have collected data on the number of Indian households actively using online
banking and/or online bill payment over a 10-year period. Which of the following is
the best for presenting the data?
a} Apiechart
b) Astem-and-leaf display
d) Atime-series plot
Q.6 You have collected data on the monthly seasonally adjusted civilian unemployment
rate for the India over a 10-year period. Which of the following is the best for
presenting the data?
a) Acontingencytable
b) Astem-and-leaf display
c) Atime-series plot
d) Aside-by-side
bar chart
Q7 You have collected data on the number of complaints for 6 different brands of
automobiles sold in the India over a 10-year period. Which of the following is the best
for presenting the data?
a) Acontingencytable
b) Astem-and-leaf display
c) Atime-series plot
a) Acontingencytable
b) Astem-and-leaf display
c) Atime-series plot
d) AParetochart
33
UNIT 2
Organizing and Visualizing
Variables
b) frequency polygon
c) ogive
d) barchart
Q.10 Thesum ofthe percent frequencies for all classes will always equal
a) One
b) Thenumber of classes
c) Thenumber
of items inthe study
d) 100
SCENARIO1
A sample of 200 students at a XYX university was taken after the midterm to ask them
whether they went bar hopping the weekend before the midterm or spent the weekend
studying, and whether they did well or poorly on the midterm. The following table contains
the result.
Q.1 Referring to Scenario 1, of those who went bar hopping the weekend before the mid-
terminthe sample, percent of them did well on the mid-term.
a) 15
b) 27.27
c) 30
d) 55
Q.2 Referring to Scenario 1, of those who did well on the mid-term in the sample,
percent of them went bar hopping the weekend before the mid-term.
a) 15
b) 27.27
c) 30
d) 50
Q.3 Referring to Scenario 1, percent of the students in the sample went bar
hopping the weekend before the mid-term and did well on the mid-term.
a) 15
OPMC001
Business Statistics
b) 27.27
c) 30
d) 50
Q.4 Referring to Scenario 1, percent of the students in the sample spent the
weekend studying and did well on the mid-term.
a} 40
b) 50
c) 72.72
d) 80
35
UNIT 2
Organizing and Visualizing
Variables
Q.2
Q.3
ot
4
a5 or
lo
Q.6
oa
Q.7
Qa
Q.8
Q9
Q.10
Q1
Q.2
Q3
Q4
Q5
o2oTr
Q.6
oe
Q7
0.8
36
OPMCO001
Business Statistics
NUMERICAL DESCRIPTIVE
WLAN OLS
STRUCTURE
3.0 Objectives
3.1 Introduction
3.2. Central Tendency
3.3. Variation and Shape
3.4 Exploring Numerical Data
3.5 TheCovariance andthe Coefficient of Correlation
3.6 LetUsSumUp
3.7 KeyWords
3.8 Self-Assessment Questions
3.9 Check Your Progress - Possible Answers
3.10 Answers to Self-Assessment Questions
3.0 OBJECTIVES
3.1 INTRODUCTION
In unit 2 we discussed tabular and graphical presentations used to summarize data. In this
unit, we will discuss the numerical measures of location that provide the additional
alternative for summarizing data. In this unit numerical measure of locations, dispersion,
shape, and association are introduced. If the measures are computed for data from a sample,
they are called sample statistics. If the measures are computed for data from a population,
they are called population parameters. In statistical inference, a sample statistic is referred to
37
UNIT 3
Numerical Descriptive Measures
3.2. CENTRALTENDENCY
Let us consider the simple case of an MCD company where Mr. Ram Kumar is the Chief
Administrative Officer. He has provided data pertaining to the employee salary as follows:
Mean the salary is INR 10,000, and the median salary of the company is INR 5,000.
_
X= —
yx andforthepopulationmean p= =
Ex
n N
38
OPMCO001
Business Statistics
Now, let us consider a 10-employee salary sheet (Refer excel sheet NOTES
unit3_workbook_worksheet1)
8,945
10,345
OOS
7,698
7589
©
9876
eS
o
Here total number of sample size is 10 and we have calculated the sample mean by adding
the total salary and dividing by 10.
This can also be calculated by using the Excel function
= AVERAGE (D4:D13).
Alternately, perform the following steps:
Step 1: take the sum of all the salary by using function = SUM ()
Step 2: divide the sum by n=10.
The calculation of meanis simple. The arithmetic mean is the least affected by fluctuations of
sampling. However, in extremely skewed distributions, the arithmetic mean is not the
suitable measure for statistical analysis.
WEIGHTED MEAN
Weighted mean is calculated by multiplying the weight/frequency with the value and then
adding it together and dividing the total with the total number of weight or frequency.
For example, ifin a store the sales data for shoes is as follows:
SHOES DATA
Shoe Size | Cost of each pair| No. of shoes
4 300 20
5 235 34
6 346 23
7 450 45
8 569 56
9 650 65
In the above example we calculated the Mean by multiplying the weight i.e. The number of
shoes of different size with their respective cost per pair and then added the product, then
divided the total sum of the products by the total number of shoes in the store.
The mean costis Rs 478.6502058.
For understanding “Geometric Mean’ and Harmonic Mean’, you can browse the site
www. mathsisfun.com
MEDIAN
The median is also a measure of the central location. Median is the middle value in the data
set—the central point as far as the values are concerned.
For example, if we have a salary data set of 100 and the median value is Rs 5,000, it means
that there are 50 employees in this organization with salary more than Rs 5,000, and 50
employees with the salary less than Rs 5,000.
45
67
234
234
345
345
432
456
567
567
567
675
678
789
876
876
987
6789
We have total 18 items of data is in the above table. We used the Excel Sort Function for
arranging them in ascending order. Dividing 18 by 2, we obtain an integer9. So, the median is
the average of the values at 9" and 10" position from the top, which in this case is 567. If the
total number of elements had been 19, the mid-point would have been 9.5, and we would
have considered the value at the 10" position.
OPMC001
Business Statistics
We refer to the quartiles - the first quartile, or 25" percentile (Q1), the second quartiles or
50" percentile (Q2), the third quartiles or 75" percentile (Q3).
The formula for finding out the position of the data as their percentile is
. 7,75 as
i= (—)x18 = 13.5 (14th position data)
100
The 75" or Q3 is therefore 789
In case of frequency distribution, the formula for calculating the median is:
1 Less than 2 4 4
2 2to4 7 11
3 4to6 10 21
4 6to8 12 33
5 8 to 10 14 47
6 10 to 12 6 53
7 12 to 14 15 68
8 14 to 16 13 81
9 16 to 18 1 82
Total 82
Now the total frequency
in this data set is 82, and n/2 is 82/2 =41
The cumulative frequency just greater than 41 is in the class 8 to 10, so our median class is 8
toclass 10.
41
UNIT 3
Numerical Descriptive Measures
Median is the best central tendency or the best statistical tools to understand the central
tendency of the data set.
MODE
Third measure to understand the central tendency is Mode. The value of the mode is
calculated using following formula:
Mode =[+h
Go
— fi)
@Xfo-fi-f)
Refer Unit3_Workbook_worksheet3
Table 3.2: Frequency Distribution of Shareholding month wise.
1 Less than 2 4 4
2 2to4 7 11
3 Ato6 10 21
4 6 to 8 12 33
5 8 to 10 14 47
6 10 to 12 6 53
7 12 to 14 15 68
8 14 to 16 13 81
9 16 to 18 1 82
Total 82
42
OPMCO001
Business Statistics
a) 20
b) Square root of 20
c} Square root of 96
d) 96
Variation and shape are one of the measure characteristics of the dataset, which give the
idea about the spread or dispersion of the value. The range is the simple measure of
variation. The range gives size of the basket of the dataset.
The Range
The Range is represented by a symbol R, and the difference between the largest and the
smallest value of the data set is called Range.
The range only states the spread of the data set. We are interested to know the distribution
of the data in the data set. We want to know how the data is scattered from the centre point.
This information we can get only when we will subtract the data value from the centre point.
The difference between the centre points and data value is called the variation, some time
the variation is positive and sometimes it is negative.
For example, ifthe central tendency of the salary in the MCD case is Rs 5, 000.
Suppose employee A salary is 3000, then the variation is:
To eliminate the effect of negativity, we have conceptualized a new variable to measure the
absolute variability.
"Variance and standard deviation’: These statistical measures the average deviation of data
around the mean, it gives the picture about the fluctuation data above it and belowits value.
The process of calculating the sample Variance (S’) and sample standard deviation is:
Step 1: Find the difference between each value and the mean
Step 2: Square each difference.
Step 5: Take the Square root of the sample variance to get the sample standard deviation.
For finding the variance and standard deviation of the frequency distribution, the formula
used is
52 = Xf M? —n(p)
7 n—-1
Here M is the midpoint of the class {2 is mean of the data set, and n is the total frequency.
The Square root of the variance is the standard Deviation.
X(xi —pw)r2
sigma = sorte SW)
The objective of converting the variance into standard deviation is to measure the deviation
of the data from the central tendency in the same unit.
We can explore the numerical data by calculating the central tendency, variation and shape.
We can also visualize the distribution of numerical variables by calculating the quartiles, five-
number summary and constructing a box plot.
In the previous section, we learnt the central tendency and the method of calculating the
Quartiles.
Here we will try to understand 'what is the percentile?', and the interquartile range first than
we will see the Five-Number Summary.
Percentile: It is related to the quartiles. The percentiles that are split into 100 equal parts. By
OPMC001
Business Statistics
this definition, the first quartile is equivalent to the 25" percentile, the second quartile to the
50" percentile and the 3” Quartile is the 75" percentile.
The Interquartile Range
The difference between the third quartile and the first quartile is called the interquartile
range.
Interquartile range =Q3-Q1
The Five-number Summary
The five- number summary of a data set is smallest value, the first quartile, the median, the
third quartile, and the largest value.
The distance from X, ,aiextO the median (40 - 28 = 12) is slightly more than the distance from
the X,,, «to the median (51-40 = 11). Therefore, the distribution is slightly left-skewed.
The Boxplot
To visualize the shape of the distribution of the five-number summary, a Boxplot is used. It is
avertical line on the horizontal axis.
b) Thecoefficient of correlation
The Covariance
One can understand the strength the linear relationship between two numerical variables (X
and Y) in an equation by measuring the covariance.
The formula for measuring the covariance is
Cov(xand y)=
yi - ¥)
n—-1
The covariance does not expressly state the relative strength of the relationship. It mainly
states the total product of individual variations. We cannot estimate whether the resultant
value indicates a strong, weak, or negative relationship and so we have coined a variable
45
UNIT 3
Numerical Descriptive Measures
The coefficient of correlation measures the relative strength of the linear relationship
between numerical variables. The sample's coefficient of correlation is represented by the
symbol ‘r’, which range from - 1 for perfectly negative correlation and + 1 for the perfectly
positive correlation. When dealing with the population the coefficient of correlation is
represented by the symbol p (rho).
Sample coefficient of correlation (r) is calculated as:
g = feces)?
a n-1
n =
ei i-¥)?
Sy- n-1
The manager of a factory wants to predict how many extra conference kits to prepare ona
day when the conference is held at the IMTCDL. A random sample of records of the last few
years is as follows:
OPMC001
Business Statistics
Population Correlation
p= Xe w(K wy)
Ctr ux? >. — Hy)’
i=1
3.6 LETUSSUMUP
In this unit, we introduced several basic elements of descriptive statistics, helpful in solving
business problems. We have also discussed the utilities of the central tendency, variance,
and correlations.
47
UNIT 3
Numerical Descriptive Measures
Q.4 Thedifference between the largest and smallest observations in an ordered data setis
called the range.
Q.5 Thestandard deviation is expressed in terms of the original units of measurement, but
the variance is not.
Q.6 Thedataset 10, 20,30 has thesame variance as the data set 100, 200, 300.
a. Themean
b. Themedian
c. Themode
d. Allthesechoices aretrue.
Q.2. Which measure(s} of central location is/are meaningful when the data are ordinal?
a. Themeanand median
b. Themeanand mode
c. Themedianand mode
Only mean
A.
OPMC001
Business Statistics
Q3 Which of the following statements about the mean is not always correct?
a. Thesumofthe deviations from the mean is zero.
d. The value of the mean times the number of observations equals the sum of all
observations.
Q.4 Which of the following statements is true for the following observations: 9, 8, 7, 9, 6,
11 and 13?
a. Themean, median, and mode are all equal.
b. Themedian
c. Themode
a. Themedian
b. Theinterquartile range
c. Themean
d. Thefirst quartile
49
UNIT 3
Numerical Descriptive Measures
When the distribution is positively skewed, mean < median < mode.
Q1 c
3.10 ANSWERS
TO SELF-ASSESSMENT QUESTIONS
Q1 Reference
Q.2 Reference
Q3 False
Q.4 True
Q5 True
Q.6 False
Q1 d
Q.2
orornegea
Q.3
Q4
Q5
Q6
Q7
»o
Q8
a
Qag
Q.10
OPMCO001
Business Statistics
Case Study:
Ages of Senior Citizens
Asociologist recently conducted a survey of citizens over 65 years of age whose net worth is
too high to qualify for Medicaid, and who have no private health insurance. The ages of 22
uninsured senior citizens were as follows: 65, 66, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 78, 79,
80, 81, 86, 87,91, 92,94, and 97.
Q.1 Calculate the mean age of the uninsured senior citizens
Ans. X =78.0years
Q.2 Calculate the median age of the uninsured senior citizens.
51
BASIC PROBABILITY
STRUCTURE
4.0 Objectives
4.1 Introduction
4.2 Basic Probability Concepts
4.3. Conditional Probability
4.4 Ethical Issues and Probability
4.5 Bayes'Theorem
4.6 LetUsSumUp
4.7. KeyWords
4.8 References and Suggested Additional Readings
4.9 Self-Assessment Questions
4.10 Answers to Self-Assessment Questions
4.11 Check Your Progress-Possible Answers
4.0 OBJECTIVES
After reading this unit, you will be able to:
4.1 INTRODUCTION
Inthe unit 1, 2 and 3, we learned the various ways of defining, organizing, and visualizing and
analyzing the data to provide a sensible picture for business decision making. In this unit, we
will understand business practices wherein Business managers talk about uncertainty or
likelihood for the sales of a particular product or service in their geographical locations in
terms of either in percentage or terms of probability (out of 1) to assess the market potential.
For example, what are the chances of the sale of FMCG goods in the U.P. market? What are
the chances for a student to clear the exam? What is the chance of getting head in a toss of a
52
OPMC001
Business Statistics
coin? Expressed as a fraction of 1, we call it probability. What is the chance of zero internet NOTES
drops per day? The principle of probability helps bridge the worlds of descriptive and
inferential statistics. Probability means the likelihood that an event will occur.
Thus, the concept of probability is used to measure the degree of uncertainty in a Business.
Probability values are always assigned on a scale of 0 to 1. A probability close to zero
indicates that the event is unlikely to happen. The probability close to 1 indicates a near-
certainty that the event will occur.
The probability is represented by a symbol P, for example, the probability of passing the exam
can be written as:
E = Passing the exam.
P(E)=.65
There are three types of probabilities
e =Apriori
e Empirical
e Subjective
For example, if | ask you to measure the probability of days of January in the year 2020. Here
you have the prior knowledge that January is having 31 days and the year 2020 is of 365 days.
31
P (January) = 365
We calculated the probability of January days in the year 2020, we found by using the simple
calculation that in this year, out of 365 days the total number of days in January is 31.
Therefore, the contribution of January out of the year is 31/365, or, in other words, the
Probability of January days in 2020 is 0.0849. Probability P is always between 0 and 1. In this
example, we used prior knowledge to measure the probability therefore this type of
probability is called a priori probability.
53
UNIT 4
Basic Probability
The probability which we use to calculate from the available dataset of the business is called
Empirical probability. The formula for measuring probability is:
PROBABILITY OF OCCURRENCE
Aneventis aset of sample points. For example, the days of January ina year are an event.
Probability
of an event
The probability of any event is equal to the sum of the probabilities of the sample points in
the event.
Some basic relationships of Probability
The complement of an event “Days of January” is defined to be the event consisting of all the
sample points that are not in January. The complement of January is denoted by (Days not of
January)°.
We can understand the Compliment of an event with the help of figure 4.1.
54
OPMC001
Business Statistics
Fig. 4.1:
CT Event A
—_———|____
The complement of event A
Addition Law:
Union of two events
The union of two events is all the sample points in an event A and event B added up i.e. it
is either A or B.
Gao»
The area shaded black and grey is AUB.
Let us try to understand with the help of an example.
Wednesdays.).
Thus, it is P (Jan) + P (Wednesdays) - P (Jan Q Wednesdays).
55
UNIT 4
Basic Probability
In the days of January and February in the year 2020, there is no common day between
January and February.
P (Jan) and P (Feb) are called mutually exclusive events.
=31+365+28+365
=(31+28)+365
=59+365
=0.1616
56
OPMCO001
Business Statistics
P(A[B)= P (A and B)
P (A and B)
P (AIB)= ——~ pray
In the above formulae, the left-hand side is the conditional probability, and we read this as
the probability of A given that situation B has occurred.
_ 200
300
= 0.66
0.66isthe conditional probability
Let us try to understand with the help of the following case study.
Consider the promotion status of male and female officers in the Education Department in
UP. The Education Department consists of 1,000 officers, 800 men and 200 women. Over the
past two years, 300 education officers obtained promotion; the breakdown for male and
female officers is shown in Table 4.1.
57
UNIT 4
Basic Probability
Table: 4.1
After reviewing the promotion record, female officers raised a discrimination issue that 200
male officers had received promotions, but 100 female officers received the promotions.
The Education Department explained that the relatively low number of promotions for
female officers was not due to discrimination, but to the fact that relatively few females are
members of the Education Department.
Let us see how conditional probability could be used to analyze the discrimination charge.
Let
EM =eventan officer isa man
EW = event an officer isa woman
A=eventan officer is promoted
Let us calculate the probability that an officer has promoted given that an officer is a man. In
other words, the probability of the officer promoted from the men categories.
P(A|EM)= > = 25
Similarly, the officer promoted from the women categories are:
P(AIEW) = "20 = 50
What conclusion do we draw?
The probability of the promotion of women from the “woman” category is just double the
probability of man promotion. The conditional probability calculation rejects the argument
58
presented by the female officers.
Ethical issues related to probability arise when we express in terms of probability or chances
but the public gets confused and is not able to trust easily. This type of situation generally
arises when the probability information is related to advertisements.
To ensure that the probability information is ethically correct, we must cite the pertinent
data set or the source of knowledge - in the process analyze the facts.
Bayes' theorem is used to obtain revised or posterior probabilities i.e. reverse the
probabilities, given existing probabilities.
We start the analysis with initial or prior probability estimates for specific events of interest,
and then from sources such as a sample, a special report, obtain additional information
about the events. From the new information, update the prior probability values by
calculating revised probability, referred to as a posterior probability. Bayes' theorem
provides a method for probability calculations.
The flow of the probability revision process is shown in Fig. 4.3.
P(X2)P(¥|X2)
PQI|Y) =
P(X1)P(¥|X1) + P(X2) P7142)
Please note that we have derived Probabilities of X1 and X2 given that Y has occurred from
probabilities of Y given X1 and X2 have occurred respectively.
59
UNIT4
Basic Probability
A manufacturing company receives raw materials from two vendors, Ram and Mohan. 70 %
are ordered from Ram and 30% from Mohan. Further, 2% of parts coming from Ram are
defective, and 5% from Mohan are defective.
The machine breaks down because of processing a bad part. What is the probability that the
part came from Ram?
Prior Probability
P{A,)= .65&P(A,)=.35.
i.e. What is Posterior probabilities P(A,|B)=? &P(A,|B)=?
We use the Tree Approach and Tabular Approach to solve the problem.
Inthe above case, we have the information as the historical quality level of the two suppliers.
Ram (A1) 98 2
Mohan {A2) 95 5
Let G is the good parts and Bis the bad parts then the Probability of good parts from supplier
lis
P(G|A1)=.98 and P(B|A1}=.02
Mohan 30 05 .015
.029
P (x|D)= -024 = 4g
029
Therefore, we can say that the probability that the defective part is from Ram is .48 means
48%. (refer to the earlier equation)
OPMC001
Business Statistics
Tabular Approach:
Let us considera case study.
Ina piston factory, machines A, B, and C manufacture, respectively 30%, 36% and 34% of the
total piston production. Of the total output 10%, 5%, and 8% respectively are defective bolts.
One bolt is drowning randomly from the lot, what is the probability that it is manufactured in
machine C?
Solution:
The probability that the bolt randomly picked from Machine C is .3617 (36.17%).
4.6 LETUSSUMUP
In this unit, we introduced the concept of Probability and learned from the example how
probability analysis can be used for decision-making in Business. We learned that probability
was a numeric value between 0 and 1 that represents the likelihood or the possibility that a
particular event will occur. We also learned about complex aspects of Probability - like Joint,
Conditional, posterior probabilities, etc.
4.7 KEYWORDS
Conditional Probability: The probability of an event given that another event already
occurred. The conditional probability ofA given B as P (A| B) =P (AQB)/P(B).
Joint Probability: The probability of two events occurring simultaneously, that is the
intersection of two events.
Marginal Probability: The values in the margins of a joint probability table that provide the
probabilities of each event separately.
NOTES phone. If the relative frequency approach for assigning probabilities is used, the
probability that the next customer will purchase a wireless phone is:
a. 0.10
b. 0.90
c. 0.50
d. Noneofthese choices.
Q.2 If Aand Bare mutually exclusive events with P (A) =0.75, then P (B):
a. canbeanyvalue between0Oand1
b. canbeanyvalue between 0and0.75
c. cannotbelargerthan0.25
d. equals0.25
Q.3 If you roll a balanced die 50 times, you expect an even number to appear:
a. oneveryother roll
b. exactly
50 times out of 100 rolls
Q.4 An approach of assigning probabilities which assume that all outcomes of the
experiment are equally likely is referred to as the:
subjective approach
pb
b. objective approach
c. classicalapproach
a. asimpleevent
b. asamplespace
c. asample
d. apopulation
Q.6 If two events are mutually exclusive, what is the probability that one or the other
occurs?
a. 0.00
b. 0.50
c. 1.00
d. Cannotbedetermined from the information given
62
OPMC001
Business Statistics
Dy
Q.7 If two events are mutually exclusive, what is the probability that both occur at the
same time?
a. 0.00
b. 0.50
c. 1.00
d. Cannotbe determined from the information given
Q.8 Iftwo events are mutually exclusive and collectively exhaustive, what is the probability
that both occur?
a. 0.00
b. 0.50
c. 1.00
d. Cannot be determined from the information given
Q.9_ If the two events are mutually exclusive and collectively exhaustive, what is the
probabilitythat one or the other occurs?
a. 0.00
b. 0.50
c. 1.00
b. 0.50
c. 1,00
d. Cannot be determined from the information given
What is the probability that a subscriber rented a car during the past 12 months
for business or personal reasons?
What is the probability that a subscriber did not rent a car during the past 12
months for either business or personal reasons?
63
UNIT 4
Basic Probability
Q.2 Assume that we have two events, A and B, that are mutually exclusive. Assume further
that we know P(A)=.30 and P(B)=.40.
a. WhatisP(AUB)?
b. Whatis P(AQB)?
c. What general conclusion would you make about mutually exclusive and
independent events given the results of this problem?
Q1 a
Q.2 c
Q3
Q4
a
Q5
7
Q.6
Qa
Q.7
08
Q9
Q.10
Q1
Q.2
OPMC001
Business Statistics
STRUCTURE
5.0 Objectives
5.1 Introduction
5.2 Definitions
5.3 Probability Distributions
5.4 Theimportance of Expected Value in Decision-Making
5.5 Binomial Probability Distribution
5.6 Poisson Distribution
5.7. LetUsSum Up
5.8 KeyWords
5.9 Case
5.10 Self-Assessment Questions
5.0 OBJECTIVES
After reading this unit, you will be able to:
5.1 INTRODUCTION
In this unit, you would be familiarized with the concepts and characteristics of probability
distributions. We shall learn about the special cases of probability distributions i.e. Binomial
distribution and Poisson distribution, their assumptions and applications using problems.
65
UNIT 5
Discrete Probability Distribution
Let us introduce some basic terms that we would be using in this unit.
We have already studied different types of variables in unit 1. Let us recapitulate the concept
of a variable. A variable is a characteristic, number, or a quantity that we are interested to
explore, e.g. age, country of birth, marital status, etc.
In this unit, we would be using the term random variable. What is the difference between a
variable and a random variable? In statistics, randomness has a pattern, like rolling an
unbiased dice, even though we are not aware of the outcome of the experiment, till the time
dices are rolled but we know that the outcome would be any one of the numbers between 1
to 6 only. So the outcome of rolling the dice is the random variable and not the variable. i.e.
when the value of a variable is the outcome of a statistical experiment, that variable is known
as arandom variable.
For more clarity let us now define a random variable;
The random variable is the one, whose range of outcomes are known in advance, but the
actual outcome will appear only after experimenting. The random variable is represented by
the letter x, and it represents a numerical value for every outcome in sample space. Just as
variables are divided into two parts - continuous and discrete, random variables are further
categorized as discrete random variables and continuous random variables.
Difference between Discrete and Continuous Random Variable
When our area of interest is to count the outcomes of the experiment, we use discrete
random variables. e.g. the number of customers expected at a petrol pump, the number of
children expected in a family,
the number of patients in an OPD, etc.
When our area of interest is to measure the outcome of the experiment which is not limited
to discrete or an integer values but can assume any range of values, we use a continuous
random variable. E.g. time taken to reach the airport, the percentage of impurity in a batch of
chemicals, the annual income of a player, etc.
e) x=no.ofgirlsinclass
f) x=temperature measured on different days ina month
66
OPMCO001
Business Statistics
g) x=no.ofcomputerssoldinaday NOTES
h) x=total daily sales of Amazon in India
03
0.25
0.2
probability
0.15 7
0.1 —
0.05 ——__§ § @ 7 7 :
—— l.
012345
67 8 9 10
Values of X
67
UNITS
Discrete Probability Distribution
probability
01234567
Values of X
3 |The number of claims received by LIC ina x= {0, A, 2ysiioms n} where nis no.
: of policy holders
particular year.
68
OPMC001
Business Statistics
Further to make it clear let’s expand example 6 and estimate the probability distribution.
Three unbiased coins are tossed simultaneously. So the sample space is S = {TTT, HTT, THT,
TTH, HHT, HTH, THH, HHH}.
Let X represents no. of heads; the probability distribution is represented in Table 5.2
Outcomes X | P(X)
TIT Q |0.125
tossing of 3 coins
simultaneously
0.4
0.35
03
= 025
-m=| 02
9 0.15
a 01
0.05
0
0 1 2 3
no. of heads
V(XDx; PCX,)-2px wt p?
Fl
VRID XP}?
i=l
A random experiment is conducted where we toss an unbiased coin 100 times and observe
that 49 times we get tail and 51 times head. This is represented as
70
OPMC001
Business Statistics
X_ | Freq
H |51
T |49
n= |100
A frequency distribution is the listing of the values that occurred when the experiment was
conducted, so frequency distribution provides the information about the variable on actual
data, whereas probability distribution is a listing of probabilities of all possible outcomes
when the experiment is conducted in the future, so probability distribution helps us in
finding the possible behaviours of a variable.
Table 5.4: Probability distributions of tossing a coin
X | P(x)
H /|0.5
T |0.5
Since it is assumed that there is an equal possibility of head or tail when a coin is flipped.
Example 5.1: Suppose that Mr. Sharma wants to start a business of selling ice cream. Since he
has limited funds, he can initially make 200 cups. So he decides to make 100 cups of both
flavours - vanilla and butterscotch. After a week he wants to know whether the demand for
both the flavours of ice cream is similar or one particular flavour is more in demand. Is it
beneficial to switch to only one flavour? The data he collected are shown in the Table 5.5
below:
Table 5.5: Demand of Flavours of Ice-cream
1 70 60
2 80 70
3 70 70
4 100 80
5 70 50
6 100 70
7 90 60
71
UNITS
Discrete Probability Distribution
Solution:
70 3 0.285714 20 1400
80 1 0.285714 | 22.86 1828.57
90 fi | 0.142857 12.86 1157.14
100 2 0.285714 | 28.57 2857.14
Total 7 1 84.29 7242.86
Mean =E(x)=84.29 i.e. the average demand for vanilla flavour ice cream was 84.29 cups
Variance= V(x)=138.78
70 0.42857 30 2100
80 1 0.14286 11.43 914.29
90 1 0.14286 12.86 1157.14
7 1 71.43 5200
Mean =E(y)=71.43 i.e. the average demand for butterscotch flavour ice cream was 71.43
cups.
Variance= V(y)=97.95
Standard deviation=SD(y)=9.89
The demands for both the flavours of ice cream are not similar, vanilla flavour is more in
demand. It is not beneficial to switch to only one flavour i.e. vanilla as on an average the
demand for butterscotch flavour is also high.
Properties of Mean of Random Variables
a) The variance of any constant is zeroi.e. V(a)=0, where ais any constant
b) IfXis a random variable, and a and bare any constants, then V(aX + b) = a’ V(X)
c) V(X+Y)=V(X)+V(Y)
d) V(X-Y)=V(X)-V(Y)
e) For any pair-wise independent random variables, X,, X,, ..., X,and for any constants a,,
ap)» a,; V(a,X,
+a, X, +. +a,X,) =a, V(X,) +a, V(X.) +... a, V(X,).
Example 5.2 The marginal probability distributions of the rate of return for two stocks are
shown below:
AXIS(x)) 0 1 2 HDFC{y)) 0 1 2
p(x) 0.5 0.3 0.2 p(y) 0.4 0.5 0.1
f) Calculate
V (X + Y) directly by using the probability distribution of
X + Y.
g) = Verify that V(X + Y) = V(X) + V(Y). Did you expect this result? Why?
h) — Findthe probability distribution of the random variable XY.
Solution
E(X)=0*0.5+1*0.34+2*0.2 =0.70
V(X)= (07*0.5+12#0,3422*0.2)-(0.77
=0.61
73
UNITS
Discrete Probability Distribution
b) i=1
E(Y)-0*0.4+1*0.5+2*0.1 =0.70
a
VY)=07 =) yx PO)-(EMY
i=l
0 .20 12 08
1 25 15 .10
Let us now learn how to estimate joint probability distribution table from the marginal table
1strow: (0,0) i.e. X=0 & Y-0, (1,0) i.e. X=1 & Y=0, (2, 0) i.e. X=2, Y=0
when x=0 and y=0 probability = 0.5 x 0.4 = 0.2, x=1, y=0, probability=0.3 x0.4=0.12, x=2, y=0,
probability =0.08 andso on
There are two options for X+Y=1, i.e. when X=0, Y=1 & X=1, Y=0 probability=0.5 x 0.5+ 0.3 x
0.4=0.37
There are 3 possible options for X+Y=2 is when X=0, Y=2 or X=1, Y=1 or X=2, Y=0, probability=
0.5x0.1+0.3 x0.54+0.2 x0.4=0.28 and soon.
cae
f) The V(X +Y) =0° x 0.2+1°x 0.3742°x 0.28+3°x 0.13+4°x0.02-(1.40)° i
V(X+Y)=1.02
8) V(X) + V(Y) = 0.61 + 0.41 = 1.02 = V(X + Y). Yes, since X and Y are independent random
variables.
xy |0 1 2 4
The option for XY=0, there are 5 possible options (0,0), (0,1), (0,2), (1,0), (2,0)
h) E(X)E(Y) = (0.7) (0.7) = 0.49 = E(XY). Yes, since X and Y are independent random
variables.
5.4 The probability distribution of a discrete random variable X is shown below, where X
represents the number of scooters owned bya family.
x |0 1 2 3
iv. P(O0<X<1)
v. P(1<X<3)
ii, E(2X’+5)
ii. E(X-2)
d) Apply the laws of variance to find the following:
I. (3X)
ji, V(3X-2)
iii. (3)
iv. V(3X)-2
Solution:
a) 1.0.35, ii.0.85, iii0.60, iv0.00, v.0.60
b) &(X)=1.25 scooter,o=0.9937 scooter
c) |. 2.55, ii. 10.1, iii. 1.55
5.5 Determine which of the following are not valid probability distributions, and explain
why or why not?
x 0 1 2 3
a
p(x) |0.15 0.25 0.35 0.45
x |2 3 4 5
b
p(x) |-0.10 0.40 0.50 0.25
x -2 -1 0 1 2
c
p(x) |-0.10 0.20 0.40 0.20 0.10
Solution:
a. This is not a valid probability distribution because the probabilities don't sum to one.
76
OPMC001
Business Statistics
X p(x)
-10 | 0.05
5 0.15
0)
4 0.1
8 0.25
12 0.3
5.9 The joint probability distribution of variables X and Y are shown in the Table below,
where X is the number of umbrellas and Y is the number of raincoats sold daily in a
small store.
X
Y 1 2 3
viii. Verify that V(X + Y) = V(X) + V(Y). Did you expect this result? Why?
Solution
I. 2.55
x 1 2 3
y 1 2 3
p(y) 0.6 0.3 0.1
Yes, because p(x,y) = p(x) - p(y) for all pairs (x, y).
E(X) = 1.7 and E(Y) =1.5
5.10 Let the random variable X represents the number of girls in a family. If there are three
childrenin a family.
Solution
78
5.4 THEIMPORTANCE OF EXPECTED VALUE IN DECISION-MAKING
As discussed in the previous section, the expected value provides information about the
center of the probability distribution. Apart from knowing the average (center), expected
value combined with monetary benefits is a very useful concept in economics, finance, etc.
To elaborate it let us look
at two different types of applications of expected value.
Example 5.3: You are interested in investing your money in the stock market. Looking at the
record you have developed the following probability distribution of the rate of return for the
stock.
Good 5% 0.45
Bad -1% 0.2
Solution
X p(x) xx P(x) | x42 x P(x)
10 0.35 3.5 35
1 5.55 46.45
E(X)=>:x,*P(x,)
a =5.55
VOQR=D? x2P(x,)-(ECK))’
i=1 = 55-(1.5) 42= 15.65
SDX&XFyVOD — 3.96
The expected rate of return is 5.55% with a standard deviation of 3.96%.
As the expected gain is positive, you should invest in stocks.
Example 5.4: A newspaper vendor has a stall at New Delhi railway station, he wants to know
the number of copies he should procure to satisfy the daily demand. He procures the
newspaper at Rs 3 and sells it at Rs 5. If any newspaper is unsold, it is a loss for the vendor.
Based on his experience the vendor has estimated the following probability distribution for 79
UNIT5
Discrete Probability Distribution
the number of copies demanded. How many copies should he procure to maximize his
profit?
50 0.07
60 0.11
70 0.33
80 0.26
90 0.19
100 0.04
Solution:
Profit/copy= selling price- procurement price = Rs 5-Rs3=Rs2
50 60 70 80 90 100
Expected
Probability 0.07 0.11 0.33 026 0.19 0.04 profit
No. of
copies
procured
If no. of copies procured is 60 and demand is 50 then profit is for S50 copies sold for the
remaining 10 copies itis aloss of Rs 3 =50x2-10x3= Rs70
Ifno. of copies procured is 60 and demand is 60 then profit= 60x 2=120 andsoon.
The maximum profit of 184.9 is obtained when the vendor stocks 70 copies of the
newspaper.
Examples based on 5.4
5.11 The probability distribution of a random variable X is shown below, where X
represents the amount of money {in 1,000s}) gained or lost in a particular game of
Rummy.
x |-4 0 4 8
ii, © P(X>3)
iii, P(O<X<4)
iv. P(X=5)
b) Find the following values and indicate their units.
I. E(X)
ii, V(X)
ii. SD(X)
Solution:
a) 1.0.40 ii.0.60 iii0.45 iv0.000
5.15 Astreet vendor has experienced that the demand for his idlis varies every day. Since
the idli stays fresh only for one day, he has to discard the unconsumed idli’s. From his
experience, the probability distribution for demand is shared below. The cost of
preparing a plate is Rs 20 and he sells each plate for Rs 70. He has also noticed that in
addition to overstock cost there is another cost of Rs 50 if a customer returns due to
insufficient stock. He needs your help to know how many idlis he should stock so that
he can minimize his obsolescence loss and opportunity loss.
Arandom experiment that results in only two mutually exclusive outcomes e.g. success and
failure then the random variable follows Bernoulli distribution. If the experiment is repeated
n number of times and in each trial probability of success p (O<p<1) is constant, then such
trials are known as Bernoulli’s trials. The probability of success is referred to as p whereas the
probability of failure as q=1-p.
0= fail
anewbom child is either girl or boy 1= girl p=0.5
0= boy
The probability distribution for a random variable X that follows Bernoulli distribution with
82 probability pis written as:
OPMC001
Business Statistics
The expected value and variance of Bernoulli distribution are given below:
Example 5.5 A retailer feels that customers prefer credit cards over cash to purchase items
that are above Rs 1,000. The probability of buying items using a credit card was 70% for the
items worth Rs 1,000 or more. To justify his belief, he observes the purchasing pattern of
customers. Suppose 5 customers are standing in the queue to make payment.
Solution:
The customer either uses a credit card or cash to make the payment, so there are only two
possible outcomes (70% success, if he uses a credit card, 30% failure if he uses cash as a mode
of payment). The mode of payment is independent of each customer, the probability of
success and failure is constant. Hence, the example satisfies the conditions of the Bernoulli
process.
83
UNITS
Discrete Probability Distribution
customer
using credit
Customer 1 | customer 2 | customer 3 | customer 4 | Events | card x Prob
s s s $s Ssss |4 02401
F SSSF |3 0.1029
F s SSFS |3 0.1029
F SSFF |2 0.0441
F s s SFSS |3 0.1029
F SFSF |2 0.0441
F s SFFS |2 0.0441
F SFFF | 1 0.0189
F s s s FSss |3 0.1029
F FSSF |2 0.0441
F $s FSFS | 2 0.0441
F FSFF | 1 0.0189
F Ss Ss FFSS |2 0.0441
F FFSF |} 1 0.0189
F s FFFS |1 0.0189
F FFFF | 0 0.0081
As our area of interest is not to identify the individual customer who uses a credit card, rather
our interest is to find the no. of customers who use a credit card, so we can combine the same
values of x.
x p(x)
0 0.0081
1 0.0756
2 0.2646
3 0.4116
4 0.2401
The above table represents a binomial distribution. In this way, the Bernoulli process when
repeated n times converges to a binomial distribution.
- When a random experiment yields only two outcomes that are mutually exclusive and
OPMCO001
Business Statistics
collectively exhaustive and the experiment is repeated n times independently, then the NOTES
random variable follows binomial probability distribution. The probability of success is
denoted by p and the probability of failure is denoted by q=1-p. Suppose that the experiment
is repeated n times and we get success for x times and failure for the remaining i.e. n-x times.
Since out of n times, we get x successes, the total number of ways in which we can attain
successis ,C..
The probability distribution for a random variable X that follows Binomial distribution with
probability pis written as:
1. Each trial results in only two outcomes which are mutually exclusive and collectively
exhaustive.
2. | Thenumber of trials ‘n’ is finite.
The binomial distribution is widely used for solving the problems; some examples where we
can apply binomial distribution are: whether the item produced in a manufacturing plant is
defective or non-defective, whether the firm will obtain the tender or not, whether the
potential customer will buy the product or not, whether the student will pass the exam or
not, whethera candidate will obtain ajob or not, etc.
The formulas shared in Section 5.3 are used to calculate the expected value, variance and
standard deviation fora binomial distribution. The formulas simplify
to
E(X)=np
V(X)=npq,
SD(X)=, | npq
For binomial distribution variance < mean. If np is a whole number then the distribution is
unimodal, mean =mode=np
Let us examine the binomial distribution graphically when its parameters n and p change.
85
UNIT5
Discrete Probability Distribution
okecekeges
toe
————
eo8e2eGe
esekeke
————
S =a
ae ——
———
ca
—
<<
a
|
—— ———
~~
- —
Re
'
'
'
'
'
=
=
ft
0
n=5,p=0.1 n=S,p=0.5 n=5,p=0.9
ofRESEEE
efeE SEE
ok2GERe
—==
Ft
Eq
=o
az
=
=
a
ug
a
|
|
os
€
e2
os
e
Hh. = HHI
on
8
S eacoes il li an ol all
o2z#46¢ EWP KRHNM UBD ORT E45 © 7 SF Fi 8215 415 617 181920 OL2I45
6 7 FH Wass wi www
From the graph, we can see that for the same value of sample size n, the shape of distribution
changes for different values of probability. If p < 0.5 the shape of the distribution is right-
skewed, for p=0.5 the shape of the distribution is symmetric, and for p > 0.5, the shape of the
distribution is left-skewed. The probability of success lies closer to the expected value. The
value of variance is highest when p and q are equal.
Similarly, we find that if p is constant and n increases then the shape of distribution
approaches symmetry.
Example 5.6: In industry there is a 30% chance that accidents occur due to chemical leak.
a) —_‘ Construct the probability distribution.
b) What is the probability that out 20 workers, 8 or more will suffer an injury due to
chemical leak?
c) | Whatisthe meanand variance of an accident?
Solution:
Here n=20, p=0.30,
a) P(X=x)=,,C, (0.3)'(0.7)"".
P(X=0) =,,C, (0.3)°(0.7)”
=1x1x0.000798
=0.000798
P(X=1)=,,C, (0.3)'(0.7)"
=20x0.3x0.00114
=0.006839
Similarly substituting the values of x we get the following table
X P(X)
0 0.00079792266
1 0.00683933711
2 0.02784587252
3 0.07160367221
4 0.13042097437
5 0.17886305057
6 0.19163898275
7 0.16426198522
8 0.11439673970
9 0.06536956555
10 0.03081708090
11 0.01200665490
12 0.00385928193
13 0.00101783260
14 0.00021810699
15 0.00003738977
16 0.00000500756
17 0.00000050496
18 0.00000003607
19 0.00000000163
20 0.00000000003
87
UNIT 5
Discrete Probability Distribution
= 1- 0.772272
= 0.227728
c) ECX= np=6, VC)= npg=4.2
The probability of 8 or more workers suffering an injury due to chemical leak is 0.227728 and
onan average 6 workers suffers an injury with a variation of 4.2 due to chemical leak.
InanR & Dlab, there are 19% chances of a radiation leak. What is the probability that out 20
workers, 10 or less will suffer an injury due to radiation leak?
88
OPMCO001
Business Statistics
x P(x) NOTES
0 0.0148
1 0.0693
2 0.1545
3 0.2175
4 0.2168
5 0.1627
6 0.0954
7 0.0448
8 0.0171
9 0.0053
10 0.0014
11 0.0003
12 0.0001
13 0.0000
14 | wee
195 0 | we
16002=« | oe
TW | wee
18 |
19 | -—-
20 | wee
Just adding the values from 0 to 10, answer the above question.
89
UNIT5
Discrete Probability Distribution
Function Aequmerts ? x
Gros. oc
Numbers 1 =!
Yeats =6
peebeys
ion sR =
Cumaamee (dR)= reuse
© O68
Reduers the Mawdsal lem beroees! GithO lon predsdaaty.
CametasOve 15 2 lognal vatve for the Cvensatwe Gitridlon Rercton, use TRUE for
the prodaduity
mast ancien, ute FALSE.
forevia rere ©
Functos Aequments ? x
GNOMOGT
=!
=6
haf
= 61
= Tt
apne
tensa tiecias
90
OPMC001
Business Statistics
P(X) F(X)
0.6 L5
0.4 1
0 0
0123 4 5 6 0123 4 5 6
; j b) P{x>=4)
c) P(x<=12)
d) P(x<5)
5.17 For the parameters n=20, p=0.2, using the Binomial table estimate the following
probabilities.
a) P(x=6)
b) P(x>=12)
c) P(x<=10)
d) P(x>8)
5.18 Find meanand standard deviation for the following binomial distribution.
a) n=10, p=0.2
b) n=50, p=0.45
c) n=82, p=0.06
d) n=300, p=0.25
5.19 For n= 14, compute the probabilities of x>=3 for following values of p.
a) p=0.15
b) p=0.25
c) p=0.35
d) p=0.45
5.20 Inthe Holiday Inn hotel, 40 percent of the customers pay by credit card.
a) Ofthenext15 customers, what is the probability that all of them pay by cash?
d) Findthestandard deviation.
e) Constructthe probability distribution function.
week, no. of customers arriving in the bank per minute, etc. As the probability p of a NOTES
particular event happening is very low the Poisson distribution is also referred to as the
distribution of rare events or, the law of improbable events. The Poisson process measures
the number of occurrences of a specific outcome of a discrete random variable in a fixed time
interval, space or volume for which an average number of occurrences of an outcome is
known or can be estimated.
So the difference between Binomial and Poisson distribution is: A Binomial random variable
counts the number of successes in fixed Bernoulli trials whereas Poisson random variable
counts the number of successes over the fixed interval of time or space.
Arule of thumb says for the approximation to be good:
“The sample size n should be equal to or larger than 20 i.e. (n 2 20) and the probability of a
single success, p should be smaller than or equal to 0.05 i.e. (p s 0.05). ifn > 100 or np < 10
then the approximation is excellent.”
Conditions
for Poisson Probability Distribution
The apply Poisson distribution, the random variable should satisfy the following conditions:
1. The number of successes within a specified time or space interval equals any integer
between zero and infinity.
2. The numbers of successes counted in non-overlapping intervals occur randomly and
are independent of each other.
3. The probability that success occurs in any interval is the same for all intervals of equal
size and is proportional to the size of the interval.
4. The average number of occurrences is constant for all time intervals of the same size.
Examples of Poisson Distribution
® The number of cars that cross Bandra-Worli Sea Link between 9AM to 12 AM
during week days.
® Thenumber of patients
at OPD waiting per hour.
The first two random variables follow a Poisson distribution concerning specific time and
third and fourth random variable follows a Poisson distribution concerning space.
The probability density function for Poisson Distribution
For the Poisson random variable X, the probability of x successes over a given interval of time
or space is given by,
e*1*
P(X=x) =
93
UNIT5
Discrete Probability Distribution
Where Ais the mean number of successes and e=2.718 is the base of the natural logarithm
Expected Value, Variance and Standard deviation of a Poisson random variable
E(X)=p=A
VXY)=07=4
SDQ) =a =r
For Poisson distribution mean = variance. The ease of Poisson’s formula makes it an eye-
catching model compared to binomial. Poisson distribution is a left-skewed distribution but
as the value of A increases, it moves more towards symmetry.
Os Oss
“e cos
os os
04
02 o1
O2
o1 oos
i ° a. ° I -
o123456789 oi2s4858 678 9 o1235:345 6789
he iS Ae2 h=3.5
ozs
02
02
os
ous
os 01
00s | oo | |
° I 2. ° I a.
o123s348 6789 o1i234567869 or23456789MUuLD
Range R= OA 2c oo
Expected value E(X)| 4
Variance V(X) A
SD(X) Vd
Shape the shape of thedistribution is right-skewed
Excel function POISSON. DIST(x, 4,0)
Example 5.7 Traffic police are planning to make some roads one way to avoid accidents. As
per the record that on an average there are 4 accidents per week at a particular intersection
onthe Kapasera border.
94
OPMC001
Business Statistics
P(X=0) = (e**4°y/0!
= 64 = 0.0183
P(X=1) = (e*4'y/41!
= e**4 = 0.0733
Similarly substitute different values of x till you get P(x) as 0.0000 as shown in the Table
below:
X P(X)
0 0.018316
1 0.073263
2 0.146525
3 0.195367
4 0.195367
5 0.156293
6 0.104196
7 0.05954
8 0.02977
9 0.013231
10 0.005292
11 0.001925
12 0.000642
13 0.000197
b) P(X=0)=0.018316
c) P(X24)=0.56645
d) P(X <3) =0.43347
Let look at the example, suppose the average number of customers walking in a bank for
some service is 3/hour.
Solution:
a) From the Poisson probability distribution table, Appendix 2 we get the following
values
P(X)
oS) x<
0.0498
0.1494
N/R
0.2240
0.2240
|W
0.1680
0.1008
A) a
0.0504
0.0216
WwW) ON
0.0081
0.0027
0.0008
lo}
pp
0.0002
ras
e
0.0001
NI
BR
0.0000
BR
Ww
Poisson Function in Excel (Formulas > More functions > Statistical > POISSON. DIST)
The function requires 3 arguments to be filled in the dialog box shown below in Fig. 5.8
96
OPMCO001
Business Statistics
Function Arguenants ? x
x ee |
Mean [3 =}
Comutetve [df = MSE
© ANGROSIISG
Retunes the Posse Gxtruten,
Cumistwe a smegma irr nen apg A AT
Potssen pr@abiity
mats fusction, ute FALSE.
=.
Casechstve Cee ee ee ene
Ponr9s probedaty mass haben, use FALSE.
97
UNITS
Discrete Probability Distribution
98
OPMC001
Business Statistics
P(X)
0.2500
0.2000
0.1500
0.1000
0.0500
0.0000
0123456
7 8 9 10111213
1.2000
1.0000
0.8000
0.6000
0.4000
0.2000
0.0000
012345
678 910111213
b) P(xX2>4)
c) P(Xs2)
d) P(X>3.5)
e) P(1<X<4)
5.22 Given a binomial distribution with n=30 and p=0.015 use a Poisson approximation to
the binomial to find
a) P(X>5)
b) P(X<8)
c) P(2<x<9)
d) P(X2>3) 99
UNIT5
Discrete Probability Distribution
5.23 Abankeris interested to know; how many credit card applications are processed at his
bank. If on an average 5 credit card applications are processed in a week. What is the
probability that at most 8 credit card applications would be processed ina fortnight?
5.24 An experienced waiter at Hyatt Hotel has a 0.3% chance of making an error while
taking an order. If he takes 500 orders, find the probability
a) Ofatleast3 errors
b) Fewerthan6errors
5.7. LETUSSUM UP
e
PDF of Poisson distribution P(X=x) =
Mean of Poisson distribution =A
5.8 KEYWORDS
Bernoulli Process: A process that results in two outcomes, probability of success remains
constant and trials are independent of each other.
Continuous Random Variable: A probability distribution in which a random variable takes NOTES
any value within a specified range.
Akshay & Mahesh have different types of responsibilities. From secondary data, they figured
out the probability of a profit by the company. Thinking about their risk, both construct their
probability distributions concerning bonus outcomes shown in the table below.
101
UNIT5
Discrete Probability Distribution
Table 5.11:
Bonus Probability
0 0.35 0.20
a) Compute the expected values to evaluate payment plans for Akshay and Mahesh.
b) Help Akshay and Mahesh to decide whether to choose Option 1 or Option 2 for
their salary.
Q.1 For each of the following random variables indicate whether the variable is discrete or
continuous and specifies the possible values that it can assume.
a. X=thenumber ofthe wrong calls received ona given day.
b. X=the amount of money lost ina month bya randomly selected gambler.
c. X=the average number of customers ina shop in an hour.
Solution:
a) discrete; x=0, 1, 2, 3,..... b) continuous; 00<x<00
e) continuous; x>0
Q.2 The random variable X represent the number of farms per family in a rural area of
Punjab, with the probability distribution: p(x) =0.05x, x= 2, 3, 4,5, or 6.
b) Find the expected number of farms per family, find the variance and standard
deviation of X.
c) Find the following probabilities:
i. P(X>4)
ii. P{X>4)
102
OPMC001
Business Statistics
iii, P(3<X>5)
iv. P(2<X<4)
v. P(X=4.5)
Solution:
a)
xX |2 3 4 5 6
Q.3 Let X represent the number of times a student visits a club in one month. To assume
that the probability distribution of X is as follows:
X |0 1 2 3
¢) What is the probability that the student visits the club at least once ina month?
d) What is the probability that the student visits the club no more than twice in a
month?
Solution:
d) P(0)+ P(1}+P(2)=0.80
Q.4 At DLF Mall the probability distribution of the number of stores shoppers enter is
shown in the table below:
x |o 1 2 3 4
p(x) | 0.05 0.35 0.25 0.20 0.15
a) Find the expected value of the number of stores entered, find the variance and
standard deviation of the number of stores entered.
d) Use the laws of expected value to calculate the mean of Y from the probability
distribution of X.
e) Calculate the variance and standard deviation of Y directly from the probability
distribution of Y.
f) Use the laws of variance to calculate the variance and standard deviation ofY from the
probability distribution of X.
g) What did you notice about the mean, variance, and standard deviation of Y= 2X +1in
terms of the mean, variance, and standard deviation of X?
Solution:
a) E(X)=2.05, V(X) =1.3475, SD(X) = 1.1688
b)
Y ji 3 5 7 9
E{Y)=5.10
E(Y) =E(2X+ 1) =2E(X)+1=2(2.05)+1=5.10
oy = 5.39 and
a, = 2.3216
Qs The joint probability distribution of variables X and Y are shown in the table below.
Aviral and Shivani have joined an automobile factory and are in their training period.
Let X denote the number of cars that Aviral will pitch in a month, and let Y denote the
number of cars Shivani will pitch ina month.
Xx
Y 1 2 3
ANS:
xX |1 2 3
y ji 2 3
xty [2 3 4 5 6
P(x+y) | 0.30 0.33 0.26 0.09 0.02
E(X+Y)=3.20
V(X+Y)= 1.06
E(X) + E(Y) =1.70+1.50=3.2=E(X+Y)
V(X) + V(Y) = 0.61 + 0.45 = 1.06 = V(X + Y). Yes, since X and Y are independent random
variables.
Q.6 Two balanced dice are rolled simultaneously. LetX is the sum of two dice.
X p(x)
50,000 0.1
60,000 0.25
70,000 0.4
80,000 0.2
90,000 0.05
105
UNIT5
Discrete Probability Distribution
08 Asurvey was conducted to find out how many credit cards a person carries.
0 0.05
1 0.35
2 0.25
3 0.2
>=4 0.15
Q.9 If a football game is a tie, to decide the winner of the match each team receives an
opportunity of 5 penalty goals. Based on the records the opposing coach believes that
the chances of conversion of all 5 goals are 0.35, chances of conversion of 4 goals are
0.30 chances of conversion of 3 goals, are 0.20, chances of conversion of 2 goals are
0.15, chances of conversion of 1 or fewer goals is 0.
a) If20 patients are admitted to the hospital what is the probability of no deaths?
b) Probability of 5 or fewer deaths.
Q.12 Inthe Burger King outlet, half of the customer’s order vegetarian burgers. NOTES
a) Whatis the probability that none of the next 5 customers will order a vegetarian
burger?
Q.15 According to the National Cancer Registry Programme of the Indian Council of Medical
Research (ICMR), more than 1300 Indians die every day due to cancer.
a) Whatisthe probability that at least 1100 people die due to cancer every day?
b) Whatisthe probability that at most 800 people die due to cancer every day?
107
THE NORMAL DISTRIBUTION
AND OTHER CONTINUOUS DISTRIBUTIONS
STRUCTURE
6.0 Objectives
6.1 Introduction
6.2 Continuous Distribution
6.3. Normal Distribution
6.4 Evaluating Normality
6.5 TheUniform Distribution
6.6 TheExponential Distribution
6.7. TheNormal Approximation to the Binomial Distribution
6.8 LetUsSumUp
6.9 Self-Assessment Questions
6.10 Answers to Self-Assessment Questions
6.0 OBJECTIVES
6.1 INTRODUCTION
This chapter will introduce you to three continuous distributions, the uniform distribution,
the normal distribution, and the exponential distribution. The text is prepared to keep in
mind that you should be able to utilize the same to solve different practical problems related
to continuous curves. Examples and practice problems will be provided to you to
supplement his knowledge. The chapter will include some exercises, to make it easier for
108
OPMCO001
Business Statistics
A random variable (x) is said to have a continuous probability distribution if the probability
distribution of all the values of x is defined within a specified interval. The most commonly
used distribution is a well-known example of the same - the normal distribution.
Theoretically, it is always useful to use probability distributions as their properties and
characteristics are well-known by now.
PROBABILITY DISTRIBUTIONS OF CONTINUOUS VARIABLES
A continuous random variable takes all possible values in a given interval, and probability
space is defined on it. It is not possible to measure all the points between any two possible
values of the continuous variable. Therefore, calculus can be utilized to find the probabilities
inalogical sense without physically measuring the same.
6.3 NORMALDISTRIBUTION
The Normal distribution is a type of continuous distribution and the most commonly used
distribution in statistics. Many real-life variables follow the characteristics of a normal
distribution, such as weight, height, length, speed, etc. Characteristics of a normal
distribution are also identified in living things, such as trees, animals, insects, etc.
Different combinations of the parameters mean |. and variance o 2 can create many normal
distributions, with the same basic shape. Therefore, in most business situations, instead of a
normal distribution, a standard normal distribution is applicable.
A standard normal distribution has a mean of0 and a standard deviation of 1. The standard
normal distribution can be obtained by subtracting original values from its mean and
dividing by the standard deviation.
Irrespective of mean and variance, any normal distribution, has the following characteristics;
a Distribution is symmetric
b. Distribution is uni-modal
c. Distribution has a continuous range from —° to +9
As mentioned earlier, there can be many normal distributions with these characteristics
given above. Therefore, standard normal variation is applied in most business situations
instead of normal distributions.
The methodology of this transformation is described below.
The second point which needs to be understood is that the probability that X is exactly equal
to some value is always close to zero because the area under the curve at a single point,
which has no width, is zero. We can calculate a non zero probability that a man weighs more 109
UNITS
The Normal Distribution and
Other Continuous Distributions
or less than a fixed amount, but the probability that he is exactly equal to the value is
infinitesimally small in a continuous distribution because of the large range of values.
The normal distribution is described or characterized by two parameters: the mean, p, and
the standard deviation, o. The values of these produce a normal distribution. The Density
Function of the normal distribution is given as:
_ _,.nFy
where
f@)= TR
p= mean of x
o= standard deviation of x
m= 3.14159..., and e = 2.71828...
We can calculate this density function for different combinations of and o. It can be
observed that for different combinations we will get different normal distributions. Thus, to
obtain normal probabilities, we use standardized normal variable Z, obtained by converting
a normally distributed variable. Using this transformation and a normal probability table, we
can avoid the tedious computations that appear in the density function above.
Standardized normal variable Z can be computed as;
x—U
Z= —
a
We can illustrate the transformation usingz (Fig. 6.2) where 1=8 and o=2
Example:
Suppose, in the final exam of the statistics class of students have got a mean score of 70 anda
standard deviation of 20. Calculate the probability that a student selected randomly from
110
OPMC001
Business Statistics
Solution
First, convert variable X into a standard variable Z. The z score can be calculated for this
problemas follows;
x—-# _80-70_10_ 05
Z= =—= 0.
o 20 20
This will provide a new variable: the value which denotes the number of standard deviations
of the old variable lying between the mean and the original variable (80). This, from the Z-
table below, (intersection of 0.5 horizontally and 0.0 vertically), is 0.1915, which is the area
between the mean and the point in question i.e. X = 80. We have obtained the probability
between the mean (70) and (80), but to know the probability greater than 80, we need to
subtract from. Rs 5,000 (one half of the curve is .5). Therefore, the probability of X > = 80
would be 0.3065. In case, we need to know the percentage less than 80, 1 has to be
subtracted from the above value, which comesto 0.6915.
111
UNIT &
The Normal Distribution and
Other Continuous Distributions
\ ad
_7 | | \
_ —=—- cE
Zz 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 | 0.0040 | 0.0080 | 0.0120 | 0.0160 | 0.0199 | 0.0239 | 0.0279 | 0.0319 | 0.0359
0.1 0.0398 | 0.0438 | 0.0478 | 0.0517 | 0.0557 | 0.0596 | 0.0636 | 0.0675 | 0.0714 | 0.0753
0.2 0.0793 | 0.0832 | 0.0871 | 0.0910 | 0.0948 | 0.0987 | 0.1026 | 0.1064 | 0.1103 | 0.1141
0.3 0.1179 | 0.1217 | 0.1255 | 0.1293 | 0.1331 | 0.1368 | 0.1406 | 0.1443 | 0.1480 | 0.1517
0.4 0.1554 | 0.1591 | 0.1628 | 0.1664 | 0.1700 | 0.1736 | 0.1772 | 0.1808 | 0.1844 | 0.1879
0.5 0.1915 | 0.1950 | 0.1985 | 0.2019 | 0.2054 | 0.2088 | 0.2123 | 0.2157 | 0.2190 | 0.2224
0.6 0.2257 | 0.2291 | 0.2324 | 0.2357 | 0.2389 | 0.2422 | 0.2454 | 0.2486 | 0.2517 | 0.2549
0.7 0.2580 | 0.2611 | 0.2642 | 0.2673 | 0.2704 | 0.2734 | 0.2764 | 0.2794 | 0.2823 | 0.2852
0.8 0.2881 | 0.2910 | 0.2939 | 0.2967 | 0.2995 | 0.3023 | 0.3051 | 0.3078 | 0.3106 | 0.3133
0.9 0.3159 | 0.3186 | 0.3212 | 0.3238 | 0.3264 | 0.3289 | 0.3315 | 0.3340 | 0.3365 | 0.3389
1.0 0.3413 | 0.3438 | 0.3461 | 0.3485 | 0.3508 | 0.3531 | 0.3554 | 0.3577 | 0.3599 | 0.3621
1.1 0.3643 | 0.3665 | 0.3686 | 0.3708 | 0.3729 | 0.3749 | 0.3770 | 0.3790 | 0.3810 | 0.3830
1.2 0.3849 | 0.3869 | 0.3888 | 0.3907 | 0.3925 | 0.3944 | 0.3962 | 0.3980 | 0.3997 | 0.4015
1.3 0.4032 | 0.4049 | 0.4066 | 0.4082 | 0.4099 | 0.4115 | 0.4131 | 0.4147 | 0.4162 | 0.4177
1.4 0.4192 | 0.4207 | 0.4222 | 0.4236 | 0.4251 | 0.4265 | 0.4279 | 0.4292 | 0.4306 | 0.4319
1.5 0.4332 | 0.4345 | 0.4357 | 0.4370 | 0.4382 | 0.4394 | 0.4406 | 0.4418 | 0.4429 | 0.4441
1.6 0.4452 | 0.4463 | 0.4474 | 0.4484 | 0.4495 | 0.4505 | 0.4515 | 0.4525 | 0.4535 | 0.4545
1.7 0.4554 | 0.4564 | 0.4573 | 0.4582 | 0.4591 | 0.4599 | 0.4608 | 0.4616 | 0.4625 | 0.4633
1.8 0.4641 | 0.4649 | 0.4656 | 0.4664 | 0.4671 | 0.4678 | 0.4686 | 0.4693 | 0.4699 | 0.4706
1.9 0.4713 | 0.4719 | 0.4726 | 0.4732 | 0.4738 | 0.4744 | 0.4750 | 0.4756 | 0.4761 | 0.4767
2.0 0.4772 | 0.4778 | 0.4783 | 0.4788 | 0.4793 | 0.4798 | 0.4803 | 0.4808 | 0.4812 | 0.4817
2.1 0.4821 | 0.4826 | 0.4830 | 0.4834 | 0.4838 | 0.4842 | 0.4846 | 0.4850 | 0.4854 | 0.4857
2.2 0.4861 | 0.4864 | 0.4868 | 0.4871 | 0.4875 | 0.4878 | 0.4881 | 0.4884 | 0.4887 | 0.4890
2.3 0.4893 | 0.4896 | 0.4898 | 0.4901 | 0.4904 | 0.4906 | 0.4909 | 0.4911 | 0.4913 | 0.4916
2.4 0.4918 | 0.4920 | 0.4922 | 0.4925 | 0.4927 | 0.4929 | 0.4931 | 0.4932 | 0.4934 | 0.4936
2.5 0.4938 | 0.4940 | 0.4941 | 0.4943 | 0.4945 | 0.4946 | 0.4948 | 0.4949 | 0.4951 | 0.4952
2.6 0.4953 | 0.4955 | 0.4956 | 0.4957 | 0.4959 | 0.4960 | 0.4961 | 0.4962 | 0.4963 | 0.4964
2.7 0.4965 | 0.4966 | 0.4967 | 0.4968 | 0.4969 | 0.4970 | 0.4971 | 0.4972 | 0.4973 | 0.4974
2.8 0.4974 | 0.4975 | 0.4976 | 0.4977 | 0.4977 | 0.4978 | 0.4979 | 0.4979 | 0.4980 | 0.4981
2.9 0.4981 | 0.4982 | 0.4982 | 0.4983 | 0.4984 | 0.4984 | 0.4985 | 0.4985 | 0.4986 | 0.4986
3.0 0.4987 | 0.4987 | 0.4987 | 0.4988 | 0.4988 | 0.4989 | 0.4989 | 0.4989 | 0.4990 | 0.4990
3.1 0.4990 | 0.4991 | 0.4991 | 0.4991 | 0.4992 | 0.4992 | 0.4992 | 0.4992 | 0.4993 | 0.4993
3.2 0.4993 | 0.4993 | 0.4994 | 0.4994 | 0.4994 | 0.4994 | 0.4994 | 0.4995 | 0.4995 | 0.4995
3.3 0.4995 | 0.4995 | 0.4995 | 0.4996 | 0.4996 | 0.4996 | 0.4996 | 0.4996 | 0.4996 | 0.4997
3.4 0.4997 | 0.4997 | 0.4997 | 0.4997 | 0.4997 | 0.4997 | 0.4997 | 0.4997 | 0.4997 | 0.4998
3.5 0.4998 | 0.4998 | 0.4998 | 0.4998 | 0.4998 | 0.4998 | 0.4998 | 0.4998 | 0.4998 | 0.4998
3.6 0.4998 | 0.4998 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999
3.7 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999
3.8 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999
3.9 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000
For a given normally distributed data, we can use Excel to calculate the standard deviation
and the mean. If the mean and the standard deviation are available, we can use the function
NORM. DIST to calculate the probability distribution of data. This functions can be used as per
the following formula. NORM.DIST (x value, mean, standard deviation, True).
112
OPMCO001
Business Statistics
Mean= 70
$.D.= 20
NORM.DIST
Standard_dev = 20
Cumulative | TRUE Ea] = TRUE
= 0,691462461
Returns the normal distribution for the specified mean and standard deviation.
Cumulative ‘is a logical value: for the cumulative distribution function, use TRUE: for
the probability density function, use FALSE.
where:
a<xl<x2<b
a+b
Mean =
b-a
Standard Deviation =
12 113
UNIT &
The Normal Distribution and
Other Continuous Distributions
NOTES Example:
Suppose, it takes around 20 to 40 minutes to complete a particular process and process time
is uniformly distributed. Calculate the probability that a process may take 25 to 30 minutes.
Hence, the chances that the process will be get completed in 25 to 30 minutes are 25%.
It is one of the important distributions among all the continuous probability distributions. To
understand the distribution of the times between random occurrences, we utilize
exponential probability distribution. Some of the characteristics of this distribution are:
e The exponential distribution is continuous and is a family of distributions.
Suppose, arrivals at a ticket counter are Poisson distributed with 3 customers every minute.
What is the probability of an interval of 2 minutes or more would be between arrivals?
Solution:
>» is always the reverse of the mean, in this case, =1/3
In the instance above, the probability is calculated as the sum of all probabilities of x > 2,
which translates to % customers per minute, and the problem to probabilities of less than
% customer per minute.
For a given Poisson distributed data, we can use Excel to calculate the probability of time
elapsed between two arrivals. This function can be used as per the following format.
114
OPMCO001
Business Statistics
nt 2
Cumutative = TRUE
= 0,997521248
£1} Returns the exponential distribution.
Cumulative is a logical value for the function to return: the cumulative distribution
function = TRUE; the probability density function = FALSE.
where,
x=0,1,2,3,4,...
p=Probability of Success ina single experiment
q = Probability of Failure ina single experiment = 1—p
The binomial distribution formula can also be written in the form of n-Bernoullitrials, where
nCx=n!/x!(n-x)!
NOTES size)the normal distribution is a good approximation for a binomial distribution. A thumb
rule states that if n, p, and n.(1-p) are both greater than or equal to 10, the normal
distribution is appropriate.
Approximation from binomial to the normal curve, requires conversion of the two
parameters of a binomial distribution, n, and p, to the parameters of the normal distribution,
Land oare given below:
w=n.p ando=./npq, where q=1-p
Once new parameters are obtained, there is a test to determine whether the normal
distribution approximation of the binomial distribution is done successfully. This test
ensures 99.7% accuracy of the empirical results. If all possible x values fall between O and n, it
depicts the acceptability of a Normal approximation of a binomial distribution.
Suppose:
6.8 LETUSSUM UP
The chapter included important formulas, concepts, distributions, and solution of the
problems, also using excel worksheets. Starting with the continuous random variable and
distribution, the chapter describes Normal, Uniform, Exponential Distribution, and Normal
Approximation to the Binomial Distribution along with suitable examples. In this chapter, we
have discussed the application of standard normal random variables, how to determine the
mean and standard deviation of a normal random variable. We also understood the real-
world applications of Normal, Uniform and Exponential Distribution. Also, the calculations
and use of Normal Approximation to the Binomial Distribution is discussed.
Q.2 Which of the statement is true for the probability distribution of the random variable?
(a) Aprobability is distributed among a range of possible values.
(b) Thesumofall probabilities of all outcomes is 1.
(c) Inanycase no probability value occurs more than once.
(d) (a)and(b)
Q3 Select one statement which matches with a given situation. Suppose p = 0.3, if a
binomial expression appearsas 3; 8 31 (0.3)° (0.7)’means that:
(a) Probability of getting exactly three successes in sixtrials.
(b) Probability of getting exactly four successes in six trials.
(c) Probability of getting three or more successes in seven trials.
(d) Probability of getting four or more successes in seven trials.
0.4 Uniformly distribution is followed by random variable x over the interval 9 to 15, then
distribution height can be obtained as
a) 1/9
b) 1/6
c) 1/15
d) 1/24
Q5 What is the mean of the distribution of x, when it is uniformly distributed and lies
between the interval 9to 15?
a) 12
b) 15
c) 9
d) 6
6.10 ANSWERS TO SELF-ASSESSMENT QUESTIONS
Q.1 c
Q.2 d
Q3 a
Q.4 b
Q5 a
117
SAMPLING DISTRIBUTIONS
STRUCTURE
7.0 Objectives
7.1 ‘Introduction
7.2 Sampling distribution
7.3. Sampling distribution of the Mean
7.4 Sampling Distribution of the proportion
7.5. Determiningsample size
7.6 LetUsSumUp
7.7. Self-Assessment Questions
7.8 Answersto Self-Assessment Questions
7.0 OBJECTIVES
recognize the distribution of asample’s mean using the central limit theorem
calculate z-value for the distribution of asample’s mean and sample proportions
estimate a suitable sample size in business situations
7.1 INTRODUCTION
This chapter will introduce you to concepts of the sampling distribution. The ultimate
objective is to enable the student to conceive the central limit theorem, distribution of a
sample’s mean, and sample proportions. Examples and practice problems are provided at
the end for solving. The chapter will include some exercises, which will allow you to calculate
the distribution of a sample’s mean, sample proportions, and appropriate sample size in
given situations.
118
OPMCO001
Business Statistics
If we draw all possible samples of size n from a given population and we compute a statistic
(e.g., a mean, proportion, standard deviation) for each sample, then the probability
distribution of this statistic is called a sampling distribution.
If we select a single random sample ofa predetermined size from the population, to use the
sample statistic to estimate the population parameter, there is a certain probability of
obtaining the same value when the entire population is evaluated. This distribution of
probabilities among all possible values the statistic may take when computed from random
samples of the same size, drawn from a specified population, is called the sampling
distribution.
The sample mean is a random variable and its values depend on the possible values of the
elements in the random sample which is drawn to compute and the distribution of the
population from which it is drawn. Asample mean has a probability distribution.
Example:
The sample mean is one of the more common statistics used in the inferential process. Let us
try to derive the sampling distribution in the simple instance of drawing a sample of size n=2
items from a population uniformly distributed over the integers 1 to 6.
Using an Excel-produced histogram, we can see the shape of the distribution of this
population
of data.
12 5
1 4
0.8
Frequency
0.6
l
0.4
1
0.2
0 - T T T T T T 1
1 2 3 4 5 6
Value
of X
119
UNIT 7
Sampling Distributions
Table 7.1: Possible Values of Two Sample Points from a Uniform Population
of the Integers 1 to 6
115 2 25 3 35 4 45 5 55
Sample Means
We can see that shape of the population is quite different from the shape of the sample
distribution.
120
OPMCO001
Business Statistics
Example: NOTES
Let us consider the example discussed above in Normal distribution. In the final exam of a
statistics class of students have got a mean score of 70 and a standard deviation of 20? If we
randomly select a group of 16 students from class, then what is the probability that the
average score of the selected students is greater than 80?
P(X >80)=?
Solution
This problem calls for determining the area of the upper tail of the distribution. The z-score
for this problemis
x-p 80-70 10
Now, the standard deviation of the sample is different. The approximate standard
distribution of the sample is the standard deviation of the population divided by the square
root of the sample size.
This “z” value is different from that derived from the entire population because the standard
deviation is divided by the square root of the sample size.
Using the z table, we arrive at a probability of 0.4772 (intersection of 2.0 and 0). This is the
probability of obtaining a value of less than 80. To derive the probability greater than 80, we
need to the probability value of .4772 from .5000, (as before), and thus the probability of the
population is greater than 80 is .0228. Hence, there is around a 2 % probability that the score
of aselected student is greater than 80.
As the sample size increases the sampling distribution of the mean tends to the normal. It is
known as the central limit theorem. The shape of the population may be anything, but this
holds for all distributions. It is one of the most important rules in statistics.
If samples of size n are drawn randomly from a population that has a mean panda standard
deviation o, the sample meansX is approximately normally distributed for sufficiently large
sample sizes (n > 30). If the population is normally distributed, the sample means are
normally distributed for any size sample.
statistic of choice. The sampling distribution of the sample proportion is based on the
binomial distribution with parameters n and p, where n is the sample size and p is the
population proportion.
As the sample size n increases, the sampling distribution of as per application of the Central
Limit Theorem, approaches a normal distribution with mean p and standard deviation
JPQ-—p)/n.
The sample proportion is computed by dividing the frequency with which a given
characteristic occurs ina sample by the number of items inthe sample.
P=x/n
where
Example:
Arecent survey in ABC Company suggests that 80% of employees consider salary as the most
important component of job satisfaction. Suppose a random sample of 100 employees is
selected, what is the probability that more than 90% of the employees consider salary as the
most important component of job satisfaction?
Answers:
In case the population is very large and difficult to compute, we study a portion of the
population. It is unrealistic to study the entire population in most situations due to economic
constraints, time constraints, and other limitations. Two important examples are predicting
election results and market surveys conducted by companies. The issue here is: how large
should a sample be? To estimate the sample size in such situations, we must know the
answers to the following three questions:
1. Howmucherrorcan we tolerate?
n -(7«22 i
E
Example
Suppose we want to estimate the average income of people in a particular region where
income ranges from $1,000 to $10,000. We want to be 95% confident and estimate is to be
based on recent data with the actual figure from the sample. If we can tolerate the error of
+5100, then what should be the sample size to estimate average income?
Solution
Here, E = $100, confidence interval required 95%, having z value 1.96, and o is unknown, so
we can estimate the same as (1/4) (range). In this case range is $1,000 to $10,000, therefore
estimate for o=(1/4} (9,000) =2,250.
2 2
Now, we have from the above calculation, estimated that we need to have a sample size of
around 1945.
Cl=ptzx/px(1-6)/n
where,
Cl=confidence interval
p=sample proportion 123
UNIT7
Sampling Distributions
Thus, for a proportion of 0.5, a confidence level of 95 %, and a sample size of 1280, the result
is:
Cl=0.541.96 x /0.5x0.5/1,280
7.6 LETUSSUMUP
This chapter includes important formulas and concepts of a sampling distribution and
determining estimates within samples as well as sample sizes. Starting with sampling
distribution, central limit theorem, the chapter also describes the distribution of a sample’s
mean and sample proportions along with suitable examples. In this chapter, we have
discussed howto determine sample sizes in a real business situation.
c. 30
d. 100
Q.5 Suppose a sample of a size of 36 is drawn from a population with a mean of 60 anda
standard deviation 12. What is the expected value of the sample means?
a. 0.6
b. 60
c. 36
d. 12
125
UNIT7
Sampling Distributions
STRUCTURE
8.0 Objectives
8.1 Introduction
8.2. Fundamentals of Hypothesis - Testing Methodology
8.3 t-Test of Hypothesis forthe Mean (sigma Unknown)
8.4 Onetailtests
8.5 Ztestof hypothesis forthe proportion
8.6 LetUsSumUp
8.7 Self-Assessment Questions
8.8 Answersto Self-Assessment Questions
8.0 OBJECTIVES
After reading this unit, you will be able to:
® understand concepts of hypothesis testing
8.1 INTRODUCTION
This unit will introduce you to the basic concepts of hypothesis like the null hypothesis,
alternate hypothesis, level of significance, type of test, etc. At the end of the unit, you will be
able to estimate the population mean when population standard deviation is unknown. Also,
the unit will provide an understanding of one tail test in a different scenario. The chapter will
include some exercises, which will allow you to observe different situations to conduct
testing of the hypothesis.
There are two ways to draw inferences about a population using sample data. First is by
calculating a point estimate of a population parameter and then form a confidence interval
around this point estimate. Secondly, a researcher often has a particular theory, or
126
OPMCO001
Business Statistics
hypothesis, that he or she would like to test. The hypothesis might be on a new machine NOTES
working properly, or requiring replacement, product sales in new packaging design will be
more or less than the current design, and so on. In the latter case, the researcher collects
relevant sample data and checks whether the data provide sufficient evidence to support the
hypothesis.
The hypothesis that the researcher is attempting to prove is called the alternative Hypothesis
or research hypothesis. The other opposite hypothesis of maintaining the status quo is called
the null hypothesis.
Let’s take the example of a vending machine to supply coffee (i.e 100 ml per cup). Suppose
there is a complaint that the machine is supplying less than 100 ml. The manufacturer wants
to test this hypothesis. In this case, the null and alternative hypotheses could be formally
stated as:
H,:>100
H,:< 100
H, is called the null hypothesis. In this example, the null hypothesis states that the vending
machine is supplying 100 ml or more. H,, sometimes written as H,, is called the alternative
hypothesis: in this case, the alternative hypothesis is that the vending machine is supplying
less than 100 ml. Note that the null and alternative hypothesis must be both mutually
exclusive (no results could satisfy both conditions) and exhaustive (all possible results will
satisfy one of the two conditions). In this example, the alternative hypothesis is single-tailed:
we state that the vending machine is supplying less than 100. Whether a test is single-tailed
or two-tailed depends on the alternate hypothesis.
We could also state a two-tailed alternative hypothesis if that is more appropriate to our
research objective. If the manufacturer is interested in whether the vending machine is
working properly, then the testing would consider both possibilities, higher or lower, and we
would state this usinga two-tailed alternative hypothesis:
H,:H=100
H,: %# 100
127
UNIT &
Fundamentals of Hypothesis
Testing: One-Sample Tests
NOTES Normally the first two steps would be performed before the experiment is designed or
the data collected; in such a situation, the statistic to be used for hypothesis testing is also
specified at this time or is implicit in the hypothesis and type of data involved. We then
collect the data and perform the statistical calculations, in this case probably a t-test, and
based on our results make one of two decisions:
The first case is sometimes called “finding significance” or “finding significant results.” The
process of statistical testing involves establishing a probability level or p-value (a topic
treated in greater length below) beyond which we consider results from our sample strong
enough to support the rejection of the null hypothesis.
In practice, the p-value is commonly set at 0.05. Why this particular value? It’s an arbitrary
cutoff point and dates back to the early twentieth century when statistics were computed by
hand and the results compared to published tables to determine whether a result was
significant or not.
Alternative lower values are sometimes used, such as p < 0.01or p < 0.001, but no one has
been successful in legitimizing the use of a higher cutoff, such as p< 0.10. Note that failure to
reject the null hypothesis does not mean that we have proven it to be true, only that the
experiment or study did not provide sufficient evidence to reject it.
Inferential statistics allows us to make probabilistic statements about the data, but the
possibility of error is inherent in the process. Statisticians have classified two types of errors
when making decisions in inferential statistics and set levels for error rates that are
commonly considered acceptable.
The two types of error are known as Type | and Type Il Errors. In our professional and personal
lives, we often have to make accept-reject types of decisions based on incomplete data. As
long as such decisions are made based on evidence that does not provide 100% confidence,
there will be possibilities of errors. No error is committed when a good prospect is accepted
ora bad one is rejected. But there is always a possibility that a bad prospect is accepted or a
good one is rejected. Of course, we would like to minimize the possibilities of such errors.
During statistical hypothesis testing, rejecting a true null hypothesis is known as a type |
error, and acceptinga false null hypothesis is known asa type Il error.
H, True H, False
Accept H, No Error Type Il Error
In hypothesis testing, the major task is to minimize the chances of type | and type II errors.
Unfortunately, it is not possible to minimize both errors. Thus by fixing one of them, the
128
OPMCO001
Business Statistics
If we select a simple random sample (SRS) of size n from any normally distributed population
with mean j.and standard deviation o, the sample mean x follows a normal distribution with
mean wand standard deviation o/vn. When gis not known, the estimated sample standard
deviation s will be used to estimate the standard deviation of x by s/n.
To illustrate the use of the t-test for the mean, let us assume that we have sales performance
of a particular location for the last ten weeks as given in Table 8.2. The business objective is
to determine whether the average sale performance has achieved $1,000 for the past ten
weeks, As a manager of the company, you need to determine whether this amount has
changed. In other words, the hypothesis test is used to try to determine whether the average
sales are increasing or decreasing.
Table 8.2: Sales Performance of a Particular Location
1 990
1,050
WwW TN
950
|
975
|
1,025
|O
1,075
Dm
975
SN
999
oa
1,000
Oo
1,100
oS
=
129
UNIT 8&
Fundamentals of Hypothesis
Testing: One-Sample Tests
Solution:
To perform this two-tail hypothesis test, the following steps are given below:
Step 2 Data is collected as a sample of n = 10 sales performance of the last ten weeks.
Suppose we decide to use a = 0.05 (level of significance).
Step 3 Because s is unknown, we will use t distribution and the t-STAT test statistic. The
assumption here is that the population of sales performance is approximately normally
distributed to use thet distribution because the sample size is only 10.
Step 4 For a given sample size, n, the test statistic t-STAT follows a t distribution with n—1
degree of freedom. The critical values of the t distribution with 10 - 1=9 degrees of freedom
can be found from the statistical table. But since we are using Excel for testing the
hypothesis, only the p-value needs to be checked for acceptance or rejection of the null
hypothesis. ais already established at 0.05 (level of significance). Thus, to reject Ho when the
p-value should be below it. This means that whenever the p-value is less than 5%, Ho will be
rejected.
As t-test for a single sample is not directly available, therefore t-Test for two-sample
assuming unequal variances is used. Average sales performance is used as a second sample
with the same values in C2: C11.
Fig. 8.1: t-Test for two-sample assuming unequal variances
jc. | H1xX Vv & | Sales Performance
2 Input: — —
2 L Range: aa
i5B 4] 975) reg aoe zRanoe
100031 =e
—
6 3 1025 1000] Hypothestea
Mean Ditterence: fo]
7 6 1075 10008 | 7 tabets
8 7 975 10004 | Ajpha:
9 8} 9991 10004 =
10 9} 1000 10004)| Output op
Bp @neowwonsnceae [__]
QW 10 1100 10004 O Qutput Range: (
Fig. 8.2: Excel worksheet results for the Average Sales Performance example t-test
4 A | B c
1 |t-Test: Two-Sample Assuming Unequal Variances
2
3 Sales Performance | Average Sales Performance
4 |Mean 1013.9 1000
5 |Variance 2296.544444 0
6 |Observations 10 10
7 |Hypothesized Mean Difference 0
8 |df 9
9 |t Stat 0.917228146
10 | P(T<=t) one-tail 0.191472125
11 |t Critical one-tail 1.833112933
12 |P(T<=t) two-tail 0.382944249
13 |t Critical two-tail 2.262157163
Interpretation;
From the Fig. 8.2 results, t-STAT = 0.917 and the p-value = 0.3829. Because the p-value of
0.3829 is greater than a=0.05, Thus, do not reject HO. The data provide insufficient evidence
to conclude that the average sales performance differs from $1,000.
8.4 ONETAILTESTS
In the last section, we saw an example of hypothesis testing in the case of two-tail tests
because the rejection region is divided into the two tails of the sampling distribution of the
mean. In the same example discussed above, suppose our focus is on a particular direction.
Suppose that the manager is worried about the sales performance of employees and wants
to arrange some new training for employees only if the test sample of the last ten weeks saw
a decreased drive-through time.
To perform this one-tail hypothesis test, the following steps are given below:
Now, the objectives is to determine whether the average sales performance of employees is
less than $1,000.
The alternative hypothesis contains the statement in which we are trying to find evidence. If
the conclusion of the test is “reject HO,” there is statistical evidence that the average sales
131
UNIT 8
Fundamentals of Hypothesis
Testing: One-Sample Tests
NOTES performance is less than $1,000. This would be a sufficient reason to arrange a new training
program for employees for better performance. If the conclusion of the test is “do not reject
HO,” then there is insufficient evidence that the average sales performance is less than
$1,000. If this occurs, there would be no reason to arrange training.
Step 2 Same data set which is asample of ten weeks (n = 10). We decide to use a =0.05.
Step 3 Because s is unknown, we will use the t distribution and the t-STAT test statistic. We
are assuming that the sales performance data is normally distributed.
Step 4 Now, in this case, the rejection region is entirely contained in the lower tail of the
sampling distribution of the mean. As the test is whether to reject the null hypothesis
therefore, we want to reject HO only when the sample mean is significantly less than $1,000.
Because, the entire rejection region is contained in one tail of the sampling distribution of
the test statistic, the test is called a one-tail test, or directional test. If the alternative
hypothesis includes the less-than sign, the critical value of t is negative. Finally, a decision
would be taken based on the p-value. As a is already established at 0.05 (level of
significance). Thus, to reject Ho when the p-value should be below it. This means that
whenever the p-value is less than 5%, Ho will be rejected.
A one-tailed test is explained using the same example, thus, the same Excel output will be
interpreted for considering other values.
Fig. 8.3: Excel worksheet results for the Average Sales Performance example t-test
(one-tailed test)
| A | B c
1 |t-Test: Two-Sample Assuming Unequal Variances
2
3 Sales Performance | Average Sales Performance
4 |Mean 1013.9 1000}
5 |Variance 2296.544444 of
6 |Observations 10 10|
7 |Hypothesized Mean Difference 0
8 |df 9
9 |t Stat 0.917228146
10 | P(T<=t) one-tail 0.191472125
11 |t Critical one-tail 1.833112933
12 | P(T<=t) two-tail 0.382944249
13 |t Critical two-tail 2.262157163
Interpretation;
From the Fig. 8.2 results, t-STAT = 0.917 and the p-value = 0.1914. Because the p-value of
0.1914 is greater than a = 0.05, Thus, we do not reject HO. The data provide insufficient
evidence to conclude that the average sales performance is less than $1,000.
132
OPMCO001
Business Statistics
p-p
V¥P(l- p)/n
Then test statistic for the test of proportion can be written as:
p-—p
VpQ-p)/n
HO: p = 0.80 {i.e., the proportion of satisfied customers has not changed from the previous
year)
H1: p <>_ 0.80 {i.e., the proportion of satisfied customers has changed from the previous
year)
Since, we are interested in determining whether the population proportion of satisfied
customers has changed from 0.80 in the previous year, therefore we will use a two-tail test.
Suppose a@ = 0.05 level of significance.
Thus, now decision about rejection and acceptance will be taken as follows:
133
UNIT8
Fundamentals of Hypothesis
Testing: One-Sample Tests
Fig. 8.3 Excel Z test work sheet results for whether the proportion of customers has changed
from 0.80.
Fig. 8.3: Excel Z test Work Sheet
a4 a4 | oe | c¢ | D |
1 |Z Test of Hypothesis
for the Proportion
2 |Available information
3 |Null Hypothesis= | 0.8}
|Level
A of Significance= 0.05
5 |Number of Satisfied Customers= _ 390}
6 | 500}
{Standard Error
) |Z Test Statistic
The NORM.S.INV functions in cells E12 and E13 determine the upper and lower critical values
visualized in Fig. 8.3. The NORM.S.DIST function was used to calculate the p-value in cell E14.
As an alternative to the critical value approach, we can also use the p-value to decide
acceptance or rejection of the null hypothesis. For this two-tail test in which the rejection
region is located in the lower tail and the upper tail, we have found the area below a Z value
of -1.1180 and above a Z value of +1.1180. Fig. 8.3 reports a p-value of 0.2635. Because this
value is greater than the selected level of significance a = 0.05, Therefore, we may accept the
null hypothesis. Hence, there is no significant change in the proportion of customers from
the previous year’s survey.
8.6 LETUSSUMUP
parameter.
c. Astatistical hypothesis is any random statement made by a researcher.
a. Levelofsignificance
b. Samplesize
c. Typeoferror
d. Anullandalternate hypothesis
Q.3 In hypothesis testing, what is the assumption about the null hypothesis?
Q.5. In the following example, null and alternative hypothesis are written as:
H,:#<10
H,: #>10
Select a statement that holds.
a. A testisatwo-tail test
b. Thetestis left tailtest
c. Thetestis one tail test
NOTES Q3 b
Q4 b
Qas ic
136
OPMC001
Business Statistics
aA eR a
STRUCTURE
9.0 Objectives
9.1 Introduction
9.2. Comparing the mean of two independent Populations
9.3. Comparingthe mean of two related Populations
9.4 Comparing the Proportions of two Independent Populations
9.55 LetUsSumUp
9.6 Self-Assessment Questions
9.7. Answers to Sel-Assessment Questions
9.0 OBJECTIVES
9.1 INTRODUCTION
This chapter will introduce the student to analyzing data from two samples. The text will be
prepared with the notion that the student should be able to test hypothesis in the case of
two populations. Simple data sets are used to demonstrate the solution. The focus will also
be on the use of z statistic or at statistic for analyzing the differences in two sample means. It
will also discuss how to make a choice when population standard deviation is known or
unknown. The concept of pooled variance is also discussed in the chapter. Z test for a
proportion, t-test for independent and related samples are discussed with the helps of
suitable examples.
137
UNIT 9
Two-Sample Test
NOTES Two-sample problems are among the most common situations encountered in statistical
practice. Sometimes researchers wish to compare two individuals or groups or companies in
terms of their performance. In this study, we need to compare two random samples, one
from each of two different populations.
In two-sample problems, the purpose is to compare the responses in two groups, where
each group is considered to be a sample from a distinct population and independent
responses.
Once we have two independent samples, from two distinct populations then measure the
same variable in both samples. Suppose we call the variable x1 in the first population and x2
in the second because the variable may have different distributions in the two populations.
x1 is the mean of an SRS of size ni drawn from an N(p21, o1) population and x2 is the mean of
an independent SRS of size n2 drawn from an N(12, 62) population. Then the two-sample z
statistic can be written as:
_ &—-%,)-G —#2)
It has the standard Normal N(0, 1) sampling distribution. If the two population standard
deviations o1 and o2 are not known, we estimate them by the sample standard deviations s1
and s2 from our two samples.
Suppose that the two Normal population distributions have the same standard deviation.
Both sample variances S,’ and S,’ estimate 62. Combine estimates are weighted averages
where weights are equal to their degrees of freedom. The resulting estimator of o’ is:
In this case, to test the hypothesis HO: ,11 = 42, (assume .05 level-of-significance) we use P-
values or critical values for the t (k) distribution, where the degrees of freedom k are either
approximated by software or are the smaller of ni-1andn2-1.
138
OPMC001
Business Statistics
Example:
Assume that the recently released movies were randomly selected to get a rating from two
different sets of age group viewers. If the ratings are normally distributed then conduct a
statistical test to determine whether there is a significant difference between average ratings
by both groups.
Table 9.1: Information Table
i 90 35
2 81 33
3 61 10
4 54 50
5 60 48
6 73 73
7 44 33
8 30 11
9 25 58
10 38 18
11 52 12
12 32 61
13 16 96
14 8 94
15 18 80
Solution:
Using Data Analysis tool pack in Excel, (downloading explained in previous unit), t-test for
two sample assuming unequal variances is used and entered data appears as in Fig. 9.1
139
UNIT9
Two-Sample Test
Fig. 9.2: Excel output of t-Test for Two-Sample Assuming Unequal variances
Group 1 Group 2
Mean 45.46666667| 47.46667
Variance 606.8380952| 854.6952
Observations 15 15
Hypothesized Mean Difference 0
df 27
t Stat -0.202614346
P(T<=t) one-tail 0.420477582
t Critical one-tail 1.703288446
P(T<=t) two-tail 0.840955165
t Critical two-tail 2.051830516
Interpretation:
From the Fig. 9.1 results, tSTAT = - 0.2026 and the p-value = 0.8409. Because the p-value of
0.1914 is greater than a =0.05, Thus, do not reject HO. The data provide insufficient evidence
to conclude that there is a significant difference between average ratings by both groups.
140
OPMCO001
Business Statistics
Assuming that the difference scores are randomly and independently selected from a
population that is normally distributed, we can use the paired t-test for the mean difference
in related populations to determine whether there is a significant population mean
difference. To test the null hypothesis that there is no difference in the means of two related
populations:
H,:2=0
H,:#0
Suppose the level of significance is « = 0.05 t-STAT test statistic for paired t-test can be written
as
t _ D- Ep
SAT Sy/Nn
Where x, is hypothesized mean difference,
D is the average of differences, S, is the standard
deviation of differences, and t-Test statistic follows a t distribution with n-1 degrees of
freedom.
Example:
ABC Network, Inc., want to investigate the impact of its marketing campaign conducted at
the beginning of this year. Before expanding their business, network managers wanted to
test whether this marketing campaign increased sales on average. Arandom sample of sales
of 10 products before and after the marketing campaign is collected. The paired
observations for each product are given in Table 9.2. Based on these data, Network
managers want to test the null hypothesis that there is no significant change in sales before
and after the marketing campaign versus the alternative hypothesis that it does. The
following solution to this problem introduces a paired-observation t-test.
141
Table 9.2; Total Sales of 10 Products Before and After Marketing Campaign
2 540 520
3 106 95
4 200 212
5 900 800
6 265 300
7 50 110
8 206 129
9 489 540
10 590 610
Variable
2 Range:
a
@ New Worksheet
Ply |
Fig. 9.2.2: Excel output of t-Test Palred Two Sample for Means
es Loe
1 |t-Test: Paired Two Sample for Means
Interpretation:
Variances are squares of Standard deviations.
From the Fig. 9.2.2 results, t-STAT =.2517 and the p-value = 0.8068, Because the p-value of
0.8068 Is greater than a = 0.05, Thus, do nat reject HO {l.e HO: 2-0 against H1:jz, = 0}. The
data provide Insufficient evidence against the null hypothests. In other words, there Is no
significant difference between sales of 10 products before and after the marketing
campaign.
The test statistic for the difference between two population proportions where the null
hypothesis difference |s zero Is given as
z= B, = PB, —0
V PO p)/ntl/m,+1/m,)
where ~,=x,/m, Isthe sample proportion In sample 1 and 2,=x, /n, is the sample
proportion in sample 2. The symbol stands for the combined sample proportion in both
samples, considered asa single sample. Thatis:
“a _*A +,
m +n,
UNITS
Two-Sample Test
When the null hypothesis is that the difference between the two population proportions isa
number other than zero, we cannot assume that 1 and 2 are estimates of the same
population proportion, and It Is not possible ta pool the two estimates. In such a situation,
the test statistic for the difference between two population proportions when the null
hypothesis difference between the two proportions is some number D, other than zero, is
given below
z= A—P,—D
¥2,0—6,)/m+5,0-2,)/m
To Illustrate the use of the Z test for the equality of the two proportions, we are golng to
discuss a problem of management institute where the management wishes to compare two
online platforms to select one most user-friendly platform to conduct an online evaluation of
the students. Management decided to float a test exam on both the platforms at different
points of time. During the exam on Platform 1 total of 200 queries were raised and 30 were
technical queries about the platform. Similarly, when the exam was conducted on platform
2, 175 queries were ralsed and 45 were technical queries, At the 0,05 level of significance, Is
there evidence ofa significant difference in both platforms in terms of user-friendliness?
Fig. 9.3: Excel Z test worksheet results for the difference between two proportions
for the two platforms
ae ie » :
p wy ; wy
Z Test for Differences in Two
Available information
Difference =
_ |Level of ificance=
Platform 1
Number of Technical Queries =
Total Queries
Platform 2
Number of Technical Queries =
0 |Total Queries
Intermediate Calculations
Platform 1
Platform 2
Difference in Two
A
Z Test Statistic
144
OPMCO001
Business Statistics
In this worksheet, we have used the NORM.S.INC function to compute the lower and upper NOTES
critical values.
To compute p-value the NORM.S.DIST function is used.
As an alternative to the critical value approach, we can also use the p-value to decide
acceptance or rejection of the null hypothesis. For this two-tail test in which the rejection
region is located in the lower tail and the upper tail, we have found the area below a Z value
of -2.5877 and above a Z value of +2.5877. Fig. 9.3 reports a p-value of 0.0096. Because this
value is less than the selected level of significance a = 0.05, Therefore, we reject the null
hypothesis. Hence, there is a significant difference in both platforms in terms of user-
friendliness’.
9.5 LETUSSUMUP
In a real-life situation, we often need to conduct a comparison of two populations. This may
happen through the comparison of independent or related samples. There are instances
when the analysis is needed to be conducted for a given characteristic and comparison of
proportions comes into the picture. In this chapter, a detailed procedure and application of
these techniques are provided with appropriate examples. The two populations are studied
by comparing two sample means. The use of the z test and t-test is discussed. Assumptions
of normality, the concept of unknown and known population standard deviation,
independent samples, related samples, pooled variance are introduced to the student for
analysis of a given situation. Solutions are obtained manually as well through Excel.
b, MiSfe
c. M#pe
d. HiT#e
Q.2 In hypothesis testing, to conduct a comparison of means of two random samples
whether both come from two different populations or the same, alternate hypothesis
can be framed as:
a. Mize
b. MiSfe
c. Fb
d. Him #2
145
UNIT9
Two-Sample Test
a. -2.58and 2.58
b. -1.96and 1.96
c. -1.28and 1.28
d. -2.33and 2.33
Q.1 d
Q.2
Q.3
Q4
Q5
146
OPMC001
Business Statistics
ANALYSIS OF VARIANCE
STRUCTURE
10.0 Objectives
10.1 Introduction
10.2 Applications of ANOVA
10.3 The Completely Randomized Design: One-Way ANOVA
10.4 The Factorial Design: Two-way ANOVA
10.5 The Randomized Block Design
10.6 LetUsSumUp
10.7 KeyWords
10.8 References and Suggested Additional Readings
10.9 Self-Assessment Questions
10.10 Check Your Progress — Possible Answers
10.11 Answers to Self-Assessment Questions
10.0 OBJECTIVES
After completion of this unit, you will be able to learn:
147
UNIT 10
Analysis of Variance
In the last chapter, we have compared two populations using a t-test. There are many
situations in practical life when we are required to compare more than two populations. The
use of a t-test is limited in such situations. The technique used to compare more than two
populations or subgroups of a population is called ANOVA {an acronym for Analysis of
Variance). ANOVA is effectively used to compare populations each containing several levels
or subgroups. Since ANOVA (analysis of variance) is the ratio of variances at its core, it uses F
distribution to test hypothesis.
ANOVA comes in many forms. One-way ANOVA (Single-factor ANOVA) has only one factor
and several groups (or levels) of the population are to be compared. A study conducted to
compare the mileage of three models of car would involve a one-way ANOVA, since it has
only one factor i.e. car model with three groups or levels (three models of the car). In the
experimental design context, one-way ANOVA is called a completely randomized design.
In a factorial design, more than one factor can be simultaneously studied in a single
experiment. In a two-way ANOVA design, there is a second factor also that accounts for
variation across various groups (or populations). If three car models are further grouped into
petrol and diesel to compare the mileage of four models, the design is called Two-way
ANOVA. The fuel used acts as the second factor with two levels (petrol and diesel).
Two-way ANOVA has two forms — two-factor without replication and two-factor with
replication. Two-factor without replication is also called as the randomized block design.
Irrespective of discipline, ANOVA can be used in every situation where a comparison is being
made between more than two groups or populations. It suggests a robust procedure for
hypothesis testing in such situations. An illustrative list of its applications in the field of
business and managementis given below:
e Examining the difference in the rate of return across various types of investments.
® To determine if there is legitimate variation in the demand of a product across
various price points.
¢ To find out whether there is a difference among various Chinese snacks based on
the cooking time.
In One-way ANOVA there is only one factor that is used to categorize the data in various
groups (populations or samples). This single factor has many levels or groups. These levels
can be categorical or numerical. Let’s illustrate this with an example.
148
OPMC001
Business Statistics
Dy
Example 10.1: A car rating agency wants to compare the mileage of three car models,
belonging to a particular car a segment, manufactured by an automobile company. It collects
mileage data from a sample of 6 cars, in each model category, to check whether the mileage
of different car models is different. The following data about mileage (in km) of 18 cars having
the same age and are driven under the same conditions were collected to control for
variation due to these factors:
Table 10.1
15 8 12
18 7 19
17 10 18
19 15 12
19 14 17
20 14 14
The comparison of the mileage of three car models requires one-way ANOVA, as there is a
single factor (car-model} with three levels (Model A, Model B, and Model C). These levels can
be categorical or numerical. In our example, three levels (Model A, Model B, and Model C)
are categorical levels. If we would categorize the cars on a factor like a price with levels - 5
Lakhs, 7 Lakhs, 10 Lakhs, and 12 Lakhs etc., then the levels would be called numerical.
One-way ANOVA is also known as single-factor ANOVA. In the experimental design context, it
is called a completely randomized design because units are randomly selected and assigned
to groups. For example 10.1, the cars of each model have been selected randomly to make it
acompletely randomized design.
Analysis of variance (ANOVA) procedure requires comparing the means of the groups. If the
means of the groups are not different from each other, the groups can be considered as part
of the same larger population, and the null hypothesis, ‘all population (group) means are
equal’, fails to get rejected. But, if means are significantly apart, the groups are considered
belonging to different populations, and the alternate hypothesis, ‘all population (group)
means are not equal’, gets accepted.
In case there are three populations (or groups), the ANOVA procedure requires comparing
means of three samples to see ifa meaningful difference exists among them. In other words,
we can measure the relative distance between the three sample means. To measure the
relative distance between three samples means, each sample mean is compared with the
overall mean (mean of the population in the background). The following three situations
may arise:
149
UNIT 10
Analysis of Variance
(i) If allsample means are equal, these samples come from the same population. We
conclude that there is no difference between the three populations. Three
different populations don’t exist in such a situation and the null hypothesis
(H,: 11 = 42 = 3) fails to get rejected.
(ii) If one of the means is far away from the other two, it is likely it is not from the
same population. In such a case, two of the three samples belong to a common
population but the third sample belongs to a different population. Since the
mean value of all samples is not the same, we reject the null hypothesis
(H,: wl = 2 = 3) and accept the alternate hypothesis (all population means are
not equal).
(iii) If allthree means are so aart, they all come from unique populations. In this case,
too, the alternate hypothesis (that all population means are not equal) gets
accepted.
To test our null hypothesis H,: 41 = 2 = 3 we are not checking if the means are exactly
equal. We are checking if each population comes from the same larger overall population or
not.
(a) Ratioofmeans
(b) Ratio of variances
Each set of data (group) is a distribution in itself with its own mean and variance (since all
values in the data set are not the same as the mean of the data set). The idea of variability is
of particular importance in ANOVA. For Example 10.1, we were checking if there was a
significant variation between the mileage of cars. Some proportion of variation in mileage of
cars will be due to differences in models. Models may differ in gross weight, engine capacity,
number of cylinders, etc. This proportion of variance is called explained variation or direct
variation because we want to compare the mileage based on car models. This variation has
been accounted for. The other proportion of variation in mileage will be due to reasons other
than the difference in car models. Such factors are not accounted for in our study. This
proportion of variance is called unexplained variation or error variation. Though we are
trying to control some factors like age and driving conditions, there may still be some factors
like driving habits, driving hours, and others that we have not concerned with but can cause
error variation to come into the picture. Thus, the total variation is composed of two parts -
150
OPMC001
Business Statistics
The two variations (between columns and within groups) add up to total variation. It is the
variance of all data values from the overall population mean.
It is the variance of sample means from the overall population mean. It comes into the
picture due to the difference in columns or groups.
Between-columns variation
Within-groups variation
If the between-columns variation is larger than the within-group variation the ratio will be
larger than 1. The samples in all likelihood, do not in such a situation originate from the
common population. In this case, the null hypothesis, ‘all means are equal’, gets rejected.
Otherwise, the null hypothesis fails to get rejected.
10.3.4SUM OF SQUARES
Variance is the average sum of squared deviations from the mean. The expression for
variance is:
n
. yy\2
Sample Variance, S’= Dai x)
n-1 151
UNIT 10
Analysis of Variance
NOTES N _
(xi-X)?
Population Variance,o2= —= 1
N
Ld —
In the above expressions, the numerator > ~%)* isthe sum of squares. If we divide SS
by n or n-1 we get the variance of a data set. Therefore, SS is variance without finding the
average of the sum of squared deviations.
The concept of the sum of squares (SS) is fundamental to variation and hence ANOVA. Each
type of variation is represented by its corresponding sum of squares (SS). Total variation is
represented by the sum of squares total (SST), between-columns variations are represented
by the sum of squares column (SSC), and within-group variation is represented by the sum of
squares within (SSW). Each SS in ANOVA has its degrees of freedom.
Itis the sum of squared deviations between each observation in the data and the grand mean
(mean of all values in the data). For N values in the data, there are N-1 degrees of freedom
associated with SST.
152
OPMC001
Business Statistics
data.
Cc
SSC Where Xiis the mean of a distribution,
>, ii
- XY Xq is the grand mean, n;is the number
$1
number of groups.
c
SSW WhereXij is individual value in the
Yow — Xi)? distribution, Xi is the mean of a
j=1
The sum of squares total (SST) is partitioned into two parts-the sum of squares among (SSC)
and the sumof squares within (SSW), i.e. SST =SSC
+ SSW
Since variance is the mean of squared deviations (sum of squares divided by the number of
observations in case of population or degrees of freedom in case of the sample), the mean
sum of squares is integral to the calculation of variances in ANOVA.
The mean sum of squares corresponding to SST is the mean sum of squares total (MST), the
mean sum of squares corresponding to SSC is the mean sum of squares column (MSC), and
the mean sum of squares corresponding to SSW is the mean sum of squares within (MSW).
N -1
- C. 1 aves
eeeee ..Equation 10.2
153
UNIT 10
Analysis of Variance
NOTES
MSW = BA! Equation 10.3
N-C
Ho: by = He = My »++0+=Me
against the alternative
MSC
vse EQuation 10.4
MSW
Since F (F,,.,) is the ratio of two variances, it will follow the F distribution, with C-1 degrees of
freedom in the numerator and N-C degrees of freedom in the denominator.
For a given level of significance, a, we reject the null hypothesis if the calculated F,,,, value is
greater than the upper-tail critical value, Fon, from the F distribution with C-1 degrees of
freedom in the numerator and N-C degrees of freedom in the denominator.
154
OPMCO001
Business Statistics
| a = 0.05 (2.98)
|
0 1 2 3 4 5 6
Upper 5% points
wy
Ye 1 2 3 4 5 6 7 B 9 10 | 12 | 15 | 20 24 | 30 | 40 | 60 | 12% | ©
1 |161-4 | 199-8 215-7 | 224-6 | 230-2 |234-0 | 236-8 | 236-9 | 240-5 | 241-9 | 2439 | 246-9 | 248-0 249-1 | 2501 | 251-1 | 252-2 | 253-3 | 254-3
2 | 18-51) 39-00 19-16) 19-25/ 19-30/ 19-33| 10-33| 19-37| 19-38] 19-40| 19-41 18-43| 19-45) 19-45 | 19-46] 19-47| 19-49| 19-49 19:50
3 | 1013] 9-55 9-98] 9:12) 9-02| 8-04/ $-89/ 885] 8-81) 8-79/ 874| 8-70| 866 864) 862) 8-59| 8:57) 8:55) 853
4 | 771) 694 6-59) 639/ 6:26) 616] 6-00) 604) 6:00) 5:96) 502) 5:86) 580 6-77| 575) &72| 569| 5-46) 5-63
8 | 661) 6-79 6-41) 6-19/ 5-05) 4-95/ 4:88) 482) 4:77) 4:74/ 468) 462) 450) 4:53/ 450) 446/ 443| 4:40) 4:36
6 | G99) GIs 4-78) 453) 4:30) 4:28] 4-21) 415) 4:10! 4:06] 400) 3:04! 3:87) 384] 381) 377) 3-74| 270| $67
7° | BSO) 4:74 4-35) 4:12| 3:97] 3-87| 3:70/ 3-73) 3-68) 3-64) 3:57/ 351} 344 341) 338) 3:34) 3-30) 3:27| 3-23
8 | G32) 446 4-07] 884 369| 8:58| 8-50/ 344| 38:89) 3:36/ 8-28) 328) 3:16 812| 308) 3:04) 301] 297| 293
% | Giz) 426 3-66) 363/ B48) 3:97) 3-29) 223] 9-19] 8-14) 3-07) 3-01] 204 290) 283) 283) 270| 295 2-71
1@ | 406) 4:10 3-71) 348) B33) 3-22) B14) 3-07) 3:02) 298) 201) 2:85) 277) 274) £70) 266) 262/ 2:48/ 2-54
11 | 484] 3-98 3-68/ 336] 820] 8-09/ 8-01/ 295/ 290/ 285/ 279) 272/ 266 261| 257| 253/ 249) 246/ 2-40
12 | 476) 389 8-40) 9:26) B11] 8:00/ 2-91] 285) 2-80/ 275/ 269) 262/ 254 251| 247/ 243) 298) 234/ 2-30
13 | 467) 882 3-41] 9:18) 808) 2-62) 288) 277) 271! 263) 260) 268) 246 242) 238/ 234) 2:30| 2:25| 221
14 | 660) 3:74 3-34) 311/ 2096] 2:85/ 276) 270/ 2-65| 2-60 263) 346/ 239) 235/ S31) 227) 223] 218) 213
15 454| 2-68 2-28 3 290) 2:79 271 2-64 2-59 2-54 248) 240) 2-33 2-29 225 220) 2:16) 2-11 2-07
16 | €49) 3-63 3-24) S01] 2:85) 2-74] 2-66) 259/ 2:54| 2-49/ 242) 235| 2:28 224/ 219) 215) 211] 206) 201
17 | 445) 3:59 3-20) 2 2-70| 261 255) 2-49/ 245] 238) 231| 223 219/ 215) 210) 206) 2-01| 1-96
18 | 441/ 3:55 3-16] 9:93] B77| 966| @s8| 251| 246) 3-41) 231| 227) 210 215/ 211/ 206) 202/ 197| 192
19 | #38] 3:52 3-13] 2-90 263| 264) 248| 242) 238] 231) 223) 216 211] 207| 208) 1-98) 1:93) 1-88
ae | 435) 349 3-10) 987/ 271) 3:60/ 261] 845) 239] 230/ B28| 320| 312) 2-03) Z04/ 1-99) 1:95| 1:90| 184
21 | 432) 3-47 3-07) 284) 268) 257) 249| 248) 237] 3:32) 225| 218] 210 2-05) 201] 1-96) 103) 1-87) 1-81
at | 4-30 3-05| 282 266| 255| 248] 240/ 234| 2-30/ 323/ 215] 207 203) 1:98] 1-94) 1-80/ 1-84] 1-78
23 | 428) 3-42 3-03) 2 264) 2653) 244] 237) 232| 2-27) 220/ 313| 205° 201) 196/ 1-91) 1-86) 1-81| 1-76
2% | 426) 340 3-01) 278) 262) 251) 242] 236) 230] 225/ 218/ 211] 208 1-93) 3104] 180) 1-84) 170) 1-73
25 | 424] 839 2-90) 276) 260/ 249/ 240) 284/ 228) 224) 216/ 2:00) 201 1:96) 1:02) 1-87] 183) 177) 71
26 | 423/ 3:37 2-08) 274) 259/ 247| 239) 232] 227) 222) 215| 207) 1-09 1-05) 1-00) 1-85] 1-80/ 1-75| 1-69
2? | 4-21] 285 9-08) 9-73) 957/ 29-46| 2-37) 981) 225) 2290/ 212| 2:08) 1-97 1-93) 189) 1-84] 1-79/ 1-78| 1-67
28 | 420) 3:34 2-95) 271/ 256/ 246/ 236| 229) 224) 219] 212 1-96 1-91| 187) 188) 177) L71| 1-85
29 | 418] 3:33 2-03) 270) 2655| 243) 235) 228] 222) 219/ 210) 203) 1-04 100) 186) 1-81] 1-76) 1-70| 1-66
3 | #17) 332 2-092) 269) 259) 242) 233| 227) 221 210) 209) 201] 1-93 1:89) 184] 1-79) 1-74| 1-68/ 1-62
4 | 408/ 3:23 2-84/ 261) 2 2:34| 225) 218) 212] 2-08] 200) 1-92] 1: 1-79| 1-74] 1-69] 1-64] 1-68) 1-55
60 | £00] 3:15 2-76] 253) 237| 2-25/ 217/ B10) 2 199} 302) 1:84] 1-75 1-70] 106] 1-59/ 153] 147| 1239
12% | 3-92) 3-07 245) 2 2:17) 2-00) 202) 1-96) 1-91] 1-88) 2-75] 166 1-61) 155) 1-50) 1-43| 1-35| 1-26
a | 384| 3:00 268) 237/ 221/ 210] 201) 104] 1-88] 1-83] 1-75) 1-67/ 1-57 153] 146) 1-89/ 1:82/ 1-22| 100
F= f= 21/22, where s}= 5,/r; end s4=5,Jvpare independent mean squares estimating a commox variance o? and based ony and v; degrees of freedom, respectively.
2 1
Solution 10.1:
155
UNIT 10
Analysis of Variance
To compare the mileage of three car models by plotting mileages of various cars. We will use
one-way ANOVA as discussed in the previous section and test the following hypothesis:
Null Hypothesis: Mean mileage of three call models are equal (H,: X,=X, =X,)
Alternate Hypothesis: Mean mileage of three call models are not all equal (H1)
MSC
The procedure requires calculating the F ratio (F Statistic), F=
MSW
If the calculated value of F is more than the critical value of F, we will reject the null
hypothesis. Otherwise, we will fail to reject the null hypothesis.
The process for calculating F is shown in the following Table 10.3 and in the working notes
that follow:
Table 10.3: Calculations of SS, MS and F
@) (2) (3) (4) 6) (6) (7) (8)
85
Working Notes:
(1) Calculate the mean for each sample (group): X,=18; X,=11.33; X,=15.33 (Column 1,
2, 3in above table)
(2) Calculate the grand mean (mean of all 18 data values): X, = 14.89
(3) Calculate the sum of the squared differences between the sample mean and the
values in the distribution for each distribution: 5(X,-X,)' = 16; 5(%,-X,)’ =59.33;
D(X, -X,)'= 47.33 (Column 4, 5, Gin above table)
(4) Add the values calculated in point 3 to get SSW= 122.67
(5) Calculate the sum of the squared differences between data values and grand mean
sample for each distribution:
(X,-X,)’ = 74.07; 5(%,-X,)* = 135.19; 5(%-X,)° = 48.52 (Column 7, 8, 9 in above
table)
(6) Add the values calculated in point 5 to get SST= 257.78
156
OPMC001
Business Statistics
(7) Calculate the sum of the squared differences between the sample mean and the
grand mean weighted by the number of observations in the sample for each
distribution: 5n,(X,-X,). =58.07; 5 n,(X,-X,)= 75.85; 5n, (X,-X,) = 1.19
(8) Add the values calculated in point 7 to get SSC= 135.11
(9) Calculate the mean sum of squares;
SSC 135.11
Mean Sum of Squares among groups (MSC}= (C-1 j 3-4 = 67.56 where c=no.
of groups.
HYPOTHESIS TESTING
The critical value of F with C-1 = 2 degrees of freedom in the numerator and N-C= 15 degrees
of freedom in the denominator for a = 0.05 in the F distribution table is F.44. = 3.68. Since
calculated F,,.> Fc: (8-26> 3.68), the null hypothesis gets rejected. Hence, the mileage of
three car models is not equal and all three samples do not come from the common
population. With sample means X,=18; X,=11.33; X,=15.33, and grand mean (overall mean)
x, = 14.89 it is observed that the first sample (Model A) with X,=18 is oddball distribution and
comes from a different population as compared to Model B and Model C. In conclusion, all
models do not have the same mileage and this variation in mileage is attributed to the
difference in models.
Source of Variation DF Ss MS F
Between-columns 80
Within-groups 560
Total
In this section, we will extend the single-factor completely randomized design (one-way
ANOVA) to a two-factor factorial design, in which we have two factors rather than one. When
there are two factors to be evaluated for making a comparison between groups, the
procedure used is called two-way ANOVA. Each of the two factors should have two or more
levels to form a factorial design in the strict sense. If any factor has only one level, then the
design takes the form of one-way ANOVA itself. The factorial design may be extended to
157
UNIT 10
Analysis of Variance
three factors or more factors, in such situations, the comparison of groups will require a
more advanced procedure called MANOVA (multiple analysis of variance).
Table 10.4
City 6 20 14 14
Table 10.4 provides a situation for use of two-way ANOVA factorial design because there are
two factors used to categorize the data. The second-factor city (with six levels) is added to
look for variation in mileage of cars based on the difference in cities where cars are driven
together with the difference in car models. A two-way ANOVA allows us to account for
variation at the row-level due to some other factor (in the column). The levels of the second
factor are called blocks and the second factor in rows is called the blocking variable.
The two-way ANOVA comes in two flavors: Without Replication and With Replication
In this type of design, data units are assigned randomly on two factors - one across columns
and the other across rows, without any specific sequence. It is also called a randomized
block design. Table 10.4 is an example of this type of design.
In this type, data units corresponding to column (same units in the sample) are repeated over
row factors across different levels i.e. different samples are considered for every Model and
replicated across the rows defined in Table 10.4. So, there are multiple measurements per
row. This is called with-replication. Table 10.5 provides an example of this type of factorial
design.
158
OPMCO001
Business Statistics
Table 10.5
18 7 19
City 1 17 10 18
16 9 16
19 15 12
19 14 17
City 2 20 14 14
18 16 15
14 7 12
City 3 19 10 17
18 12 10
14 14 12
Four cars belonging to models A, B, and C each are driven in city 1 and mileage readings are
taken, then the same four cars belonging to models A, B, and C are replicated in city 2 to take
mileage readings and then again replicated in city 3.
Three variations (between columns, between rows and error) add up to total variation.
159
UNIT 10
Analysis of Variance
Remaining part of variation which is not due to the difference in either column or rows. This
is an unexplained variation.
Total Variation = Between-columns variation + Between-blocks variation + Error variation
Total variation is represented by the sum of squares total (SST), between-columns variation
is represented by the sum of squares column (SSC), between-rows variation is represented
by the sum of squares blocks (SSB}, and error variation is represented by the sum of squares
error (SSE). Each SS hasits degrees of freedom.
N
SST = >. (Xij —Xg)*__withN-1 degrees of freedom
ij=1
c — —
B — —
The mean sum of squares corresponding to SST is the mean sum of squares total (MST), the
mean sum of squares corresponding to SSC is the mean sum of squares column (MSC), the
mean sum of squares corresponding to SSB is the mean sum of squares blocks (MSB), and
mean sum of squares corresponding to SSE is mean sum of squares within (MSE).
SSC
MSC 3 rsiscisiscvcccssins Equation 10.7
160 C-1
OPMCO001
Business Statistics
SSE
eetesesnnsesvareeess Equation 10.9
MSE ~ (B-1)x(C-1)
F1 will follow the F distribution, with C-1 degrees of freedom in the numerator and
(C-1)x (B-1) degrees of freedom in the denominator.
F2 will also follow the F distribution, with B-1 degrees of freedom in the numerator and
(C-1) x (B-1) degrees of freedom in the denominator.
For a given level of significance, a, we reject the null hypothesis if the calculated F1 value is
greater than the upper-tail critical value, Francs, from the F distribution with C-1 degrees of
freedom inthe numerator and (C-1) x (B-1). Thus, the decision rule is:
Ho: Ha = He = Hageeoee = He
will be reject if F, > Fosneu
error variation making the difference between columns visible. The ratio, F2 = M56 will
determine the proportion of variance between columns as compared to error variance. This
will how be look at column difference by controlling
the variance between rows.
Example 10.2: A multinational company has designed a comprehensive program for training
its executives. There are four different types of training programs. To evaluate the
effectiveness of these training programs company divided its executives into four groups and
different groups were offered different training programs. After the training was over
executives were asked to appear in the skill-aptitude-enhancement examination test. Five
executives belonging to different levels of experiences were randomly selected from each
training program and their scores were recorded in the table given below.
161
UNIT 10
Analysis of Variance
Table 10.6
Training Programme
0-2 10 20 10 30
2-5 10 30 5 50
5-10 20 30 10 20
10-15 25 30 20 20
More than 15 25 40 20 40
The management wants to know whether the training programs are equally effective i.e.
whether they result in the same level of skill-aptitude enhancement or not. Simultaneously,
management also wants to examine whether there is some difference in the effectiveness of
programs based on the experience of employees.
What makes this problem fit for two-way ANOVA is that the executives themselves will have
their natural variation due to their different years of experience. The two-way ANOVA allows
us to account for experience variation to better determine if a difference exists among
training programs without experience variation masking any training differences. So, we can
untangle all of the sources of variation. We are interested in differences between the training
programs but since we are dealing with human beings, we have to account for natural
variation that exists among the executives themselves. We need to extract that variation
before looking at any difference that exists between training programs. And since the
sampling method involved is random without any specific sequence of selecting executives,
itisa randomized block design.
Solution 10.2:
0-2 10 20 10 30 17.5
2-5 10 30 5 50 23.75
5-10 20 30 10 20 20
10-15 25 30 20 20 23.75
Morethani5 | 25 40 20 40 31.25
Group Means | 18 30 13 32 Grand Mean = 23.25
162
OPMC001
Business Statistics
23.75 0.250
20 10.563
23.75 0.250
31.25 64.000
Sum = 108.125
163
UNIT 10
Analysis of Variance
2 MSC _ 424.58 5 9,
MSE 71.46
= MSB _ 108.13 _4 5,
MSE 71.46
Step 8: Hypothesis Testing
Since F,> F.., (5.94 >3.49), the null hypothesis: H,: 1, = 4, =p, gets rejected.
Products
South 5 7 2 2
North 6 6 8 3
West 3.5 5.9 8 2
The sales manager wants to examine whether four products have equal sales potential.
Answer the following:
There are countless situations where the comparison is made between groups to examine
the difference (or variation) between them. Analysis of variance (ANOVA) helps to detect
such differences between groups by comparing their means. The relative distance between
each group-mean and the overall mean of the data is calculated to visualize this subtle
difference between means. The groups may have equal or different means. If relative
differences are equal, the means are not significantly different from one another, and there
is no evidence to reject the null hypothesis. It suggests that groups (or samples) come from
the same common population with no significant variation between them. If relative
165
UNIT 10
Analysis of Variance
NOTES differences are not equal, the means are significantly different from one another, and there
is enough evidence to reject the null hypothesis. It suggests that groups (or samples) come
from different populations with significant variation between them.
Analysis of variance has a family with various types of designs. Designs with a single factor
are called one-way ANOVA and two or more factors are called factorial design. Designs with
two factors are called two-way ANOVA and more than two factors are called MANOVA
(multiple analysis of variance). In an experimental context, one-way ANOVA is known as a
completely randomized design. Two-factor ANOVA comes in two flavors - without replication
and with replication.
Explained source of variation in one-way ANOVA is called between-group variance
(between-column variance) and unexplained variation {error variation) is called within-
group variance.
Since ANOVA is a ratio of variances, for more than two groups, F distribution is used to test
the hypothesis. F ratio in one-way is the ratio between the mean sum of squares column
(MSC) and the mean sum of squares within (MSW). If the F value is more than the critical F
value with C-1 degrees of freedom in the numerator and N-C degrees of freedom in the
denominator the null hypothesis is rejected. In two-way, F value is the ratio between the
mean sum of squares column (MSC) and the mean the sum of squares error (MSE). The ratio
of the mean sum of squares block (MSB) to mean sum of squares error (MSE) explains the
proportion of variation due to differences in blocks (rows) out of error variation. The second
variable essentially controls for variation (blocks the variation) between blocks (rows) for
identifying the variation between groups. Therefore, the row variable is also called the block
variable in the two-way ANOVA. An analysis will be required to do pair-wise comparisons to
know which group or groups are significantly better than the others.
10.7 KEYWORDS
166
OPMC001
Business Statistics
Sum of squares blocks: The sum of squared deviations between blocks means of each block
and the grand mean, weighted by the number of groups.
Sum of squares column: Sum of squared deviations between the sample mean of each
group and the grand mean, weighted by the sample size in each group.
Sum of squares total: Sum of squared deviations between each observation in the data and
the grand mean.
Sum of squares within: Sum of squared deviation between each value of the group and
mean of that group summed over all groups.
Total variation: Variation from all sources put together (sum of explained and unexplained
variation).
Two-factor with replication: Two-way ANOVA when data units corresponding to the column
are repeated over row levels such that there are multiple measurements per cell.
Two-factor without replication: Two-way ANOVA when data units are assigned without any
specific sequence.
Two-way ANOVA: Analysis of variance with two factors.
Levine, David M., Stephan, David F., and Szabat, Kathryn A., 2016, Statistics for Managers
Using Microsoft Excel, 7th edition, Pearson India Education Services Pvt. Ltd., Noida.
Levine, Richard L., Rubin Davis S., Rastogi, Sanjay, and Siddiqui Masood Husain, 2013,
Statistics for Management, 7th edition, Pearson Education Inc., Noida.
Shrivastava, T.N. and Rego Shailaja., 2008, Statistics for Management, Tata McGraw-Hill,
New Delhi.
Gupta, S.P. and Gupta M.P., 2010, Business Statistics, 16" edition, Sultan Chand & Sons, New
Delhi.
means.
(b} Sum of the squared deviation of all group means from the overall mean of the
data weighted by the number of observations in each group.
(c) Sumofthe squared deviation of all row means from the overall mean of the data
weighted by the number of groups.
(d) Sum ofthe squared deviation of all data values from the overall mean of the data.
Q3 Blocking variables serves the purpose of:
Anexperiment has a single factor with five groups and seven values in each group.
(a) 4
(b) 5
(c) 6
(d) 7
Q.6 How many degrees of freedom are there in determining
the within-group variation?
(a) 20
(b) 28
(c) 30
(d) 35
Q.7 How many degrees of freedom are there in determining
the total variation?
(a) 24
(b) 28
(c) 34
(d) 35
168
OPMC001
Business Statistics
In the above experiment having a single factor with five groups and seven values in
each group, If SSC =60 and SST =210
Q8 WhatisSSW?
{a) 270
{b) 150
{c) 350
(d) 285
Q.9 WhatisMSC?
(a) 5
{b) 10
{c) 15
(d) 20
Q.10 Whatis MSW?
(a) 5
(b) 10
(c) 15
(d) 20
a
ay
Teo
Ni me Oe Q4 SST=65; SSC =20.13; SSB =11.5; SSE=33.38
Q5 MSC=6.71; MSB = 3.83; MSE=3.71
Q6 Null hypothesis HO: p11 = 12 =p3and alternate hypothesis H1:Not all population means
are equal
Q.7 Fear = 1.81; FCRIT =3.86, Since F..4,< Fears HOis not rejected
Q.12 It is concluded that all group means are equal, meaning all samples come from the
Same population. In other words, there is no variation in sales potential across
different products keeping in consideration the natural variation between the
markets.
170
OPMC001
Business Statistics
CHI-SQUARE TEST
STRUCTURE
11.0 Objectives
11.1 Introduction
11.2 Applications
of Chi-Square Test
11.3 Requirement of Chi-Square Test
11.4 Chi-Square Test for the Difference between Proportions
11.5 Chi-Square Test of Independence
11.6 Chi-Square Test of Goodness of Fit
11.7 LetUsSum Up
11.8 KeyWords
11.9 References and Suggested Additional Readings
11.10 Self-Assessment Questions
11.11 Check Your Progress — Possible Answers
11.12 Answers to Self-Assessment Questions
11.0 OBJECTIVES
e hypothesis testing for the difference between two proportions using the chi-
square test
e hypothesis testing for the difference between more than two proportions using
the chi-square test
e hypothesis testing for independence between two categorical variables
171
UNIT 11
Chi-Square Test
11.1 INTRODUCTION
A super specialty hospital has four clinics — cardiology, nephrology, hepatology, and pul
monology. A different number of patients, those who are overweight and those who are not
overweight, visit each of these clinics. The director of the hospital has a belief that
overweight patients are more likely to have some diseases as compared to other patients.
She thinks that being overweight is the reason for the difference in the number of patients
visiting various clinics for treatment. To examine her prepositions, she collects data from a
sample of 450 patients who visited the hospital in a particular week and records the results in
the following contingency table:
Table 11.1: Number of Patients
Overweight 70 77 57 89 293
She also has a feeling that the incidence of these diseases is related to the liquor
consumption frequency of patients. She records the data about liquor consumption
frequency and incidence of each of the diseases in the following contingency table:
Table 11.2: Number of Patients
In addition to the above information, there is a prevailing theory in the field of medicine that
the proportions of patients who are overweight and those who are not overweight in the
overall population are equal.
(1) Whether the proportions of patients having each type of the disease are different
because of their being overweight?
(2) Whetherthetype of disease and the liquor consumption frequency are related?
172
OPMCO001
Business Statistics
(3) Whether the observed categorical distribution of patients’ weights is the same as the NOTES
theoretical distribution of patients’ weight?
In previous units, we have learned the concept of hypothesis testing including (one-sample
tests, unit 8), the procedure for comparing means of two populations and two proportions
(two-sample tests, units 9), and the procedure for comparing means of more than two
populations (analysis of variance, unit 10). None of our learnings of previous units will be of
help to test the medical director’s prepositions about the incidence of various diseases
(measured by the number of patients visiting various disease clinics) based upon the weights
of patients, and about the relationship between disease and the liquor consumption
frequency, because we are required to test the difference among more than two
independent population proportions (cardiology, nephrology, hepatology, and pulmonology)
foracategorical variable with two levels (overweight and not overweight).
In this chapter we extend the concept of hypothesis testing to (a) analyze differences
between various population proportions (based on two or more samples); (b) test the
hypothesis of independence in the joint of responses of two categorical variables, and (c} to
test the goodness of fit of a distribution. We will learn to use chi-square to test the above
hypotheses and to examine three prepositions made by the medical director (in the above-
mentioned case of a super-specialty hospital). The hypothesis testing procedure uses a test
statistic that is approximated by a chi-square (X) distribution.
As discussed in the previous section, a chi-square can be used in all such situations where we
want to examine - the difference between various population proportions for a categorical
variable of interest; the relationship between two categorical variables, each with many
levels; and the appropriateness of a distribution. An illustrative list of applications of the chi-
square testis given below:
e Comparison of difference between hotels for the proportion of guests who are
likely to return.
® To check the relationship between the primary reason for buying online and
customer satisfaction across various e-commerce websites.
NOTES Q.2 Can we use the chi-square test to determine whether the average summer
temperatures are different across northern, southern, eastern, and western parts of
India? (Yes/No)
Q.3 Results of hypothesis tests of difference between two samples by using the Z test and
X’ test will be different. (True/False)
Variables involved are categorical type. The population proportions (or populations) are
different categories or levels. In the Table 11.1 types of disease with four different levels is a
categorical variable. The second variable containing the items of interest (patients who
smoke in Table 11.1) is also categorical (with two levels).
Count of items under various categories of variables of interest is recorded in either 2xcorr
x c contingency tables. Table 11.1 is a 2 x 4 contingency Tables with a total of rows and
columns
as 450.
11.3.3 FREQUENCY
Observed Frequencies (f,}: These are the frequencies that are observed in a sample (or
collected data}. In other words, these are the counts of a sample distribution. Counts in table
11.1 are observed frequencies.
Expected Frequencies (f,): These are the frequencies or counts that are expected in a
theoretical distribution. Expected frequencies can be known based onsome theory or canbe
calculated as proportions from the sample data. For example, a theoretical distribution of
500 tosses of a coin will contain 250 heads and 250 tails; similarly, a theoretical distribution
of results of a dice experiment will contain an equal number of outcomes - 1, 2, 3, 4,5 and 6.
These outcomes are expected frequencies. The expected frequencies for data in Table 11.1
canbe calculated as proportions of the total sample size.
174
OPMCO001
Business Statistics
Table 11.3
X Y Total
| 10 50 60
II 30 110 140
175
UNIT 11
Chi-Square Test
Table 11.4
(Preference of Mobile Phones for Online Shopping by Employees)
Profession
Corporate | Government Total
Employees | Employees
Prefer Mobile Phone for Shopping 166 154 320
Hypothesis Testing
Null Hypothesis H,: 1, =7, states that two population proportions are equal, and Alternative
Hypothesis H,: 71, #7, states that two population proportions are not equal.
The X’ test statistic is equal to the squared difference between the observed and expected
frequencies, divided by the expected frequency in each cell of the table, summed over all
cells of the table.
X= » ( ne )
all cells
We reject the null hypothesis if the calculated value of X’ is greater than X’ distribution value
with 1 degree of freedom. A chi-square test with r rows and c columns has (r-1) x (c-1)
degrees of freedom. Since the test of difference between two populations has r=2 andc=2,
it has (2-1) x (2-1) = 1x 1= 1 degrees of freedom. Rejecting a null hypothesis will mean that
two proportions are different for a categorical variable of interest.
If the calculated value of X’ statistic is less than or equal to X’ distribution value with 1 degree
of freedom we fail to reject the null hypothesis to conclude that two proportions are not
different for a categorical variable of interest. It means that sample proportions that we
compute for each of the groups would differ from each other only by chance. Each sample
would then provide an estimate of the common population parameter, 1. A statistic that
combines these two separate estimates into one overall estimate of the population
parameter provides more information than either of the two estimates could provide by
itself.
Example 11.1: Sample data from 500 employees belonging to two different types of
professions (corporate and government) was collected to know whether these two groups of
employees were different in terms of their preference for use of mobile phones for online
shopping. Formulate the appropriate hypothesis, test it at a 5% level of significance, and
determine whether the two groups are different.
176
OPMC001
Business Statistics
Profession
Corporate | Government Total
Employees | Employees ore
Prefer Mobile Phone for online Shopping 166 154 320
Solution: Since the objective is to determine whether two groups are different from each
other about their preference for using mobile phones for doing online shopping, the
condition entails the use of chi-square for testing the difference between two proportions
(corporate employees and government employees).
Alternative Hypothesis H,: nm, # 7, states that two population proportions are not equal
(different)
Table 11.6
Profession
177
UNIT 11
Chi-Square Test
Table 11.7
Profession
Corporate Government Total
Employees Employees ore
Prefer Mobile Phone for online Shopping |(234 x 320)/500 | (266x320)/500 | 320
Don’t Prefer Mobile Phone for online Shopping| (234 x 180)/500| {266x180)/500 | 180
Profession
Corporate Government Total
Employees Employees
Prefer Mobile Phone for online Shopping 149.76 170.24 320
Don’t Prefer Mobile Phone for online Shopping 84.24 95.76 180
Fo-fe 2
X=
2
( fe ) =9.20
—
all cells
Since the calculated value of X’ statistic is > table value of X’, we reject the null hypothesis
that there is no difference in preference of the mobile phone for online shopping between
two groups of employees. This means two proportions (groups of employees) are
178
OPMC001
Business Statistics
significantly different. We conclude that the proportion of corporate employees who prefer
the mobile phone for online shopping is different from the proportion of government
employees who prefer mobile phones for online shopping.
11.4.2 CHI-SQUARE TEST FOR DIFFERENCE AMONG MORE THAN TWO PROPORTIONS
In this section, we extend the chi-square test to compare more than two independent
populations. Instead of two proportions, we have many numbers of proportions. We use
letter c to represent the number of independent proportions. Thus, the contingency table
will have 2 rows and c columns.
Hypothesis Testing
Null Hypothesis H,: 1m, = W, = .... = m, states that there are no differences among the c
population proportions, and
Alternative Hypothesis H,:states that not all the c population proportions are equal.
The number of degrees of freedom will be = (2-1) X (C-1)
Rest of the procedure for calculating the values of expected frequencies and X’ statistic and
decision criteria for rejecting or not rejecting the null hypothesis is the same as in the case of
two proportions.
Example 11.2: A super specialty hospital has four clinics — cardiology, nephrology,
hepatology, and pulmonology. A different number of patients, those who are overweight
and those who are not overweight, visit each of these clinics. The director of the hospital
wants to examine whether being overweight is the reason for the difference in the number
of patients visiting various clinics for treatment. She collects data from a sample of 450
patients who visited the hospital in a particular week and records the results in the following
(2x) contingency table:
Overweight 70 77 57 89 293
Null Hypothesis H,; nm, =, =", = 1, states that the proportion of patients visiting each of the
four clinics are equal, and
Alternative Hypothesis H,: not all the four proportions of patients are equal.
Table 11.11
Overweight 70 77 57 89 293
Table 11.12
Overweight (105 x 293)/450 | (120 X 293)/450| (95 X 293)/450 | (130 X 293)/450 293
Not Overweight |(105 x 157)/450 | (120 X 157)/450| (95 X 157)/450 | (130 X 157)/450 157
180
OPMCO001
Business Statistics
5 (fo-fe)'/fe = 1.89
X’= 189
Table X’ value with 3 degrees of freedom and a=0.05 is ((2-1) X (4-1)) = 7.815 (from the table)
Since calculated X° value (1.89) < the table X’ value (7.815), we fail to reject the null
hypothesis. Hence, we do not have sufficient evidence to say that the population
proportions are different from each other. We conclude that there is no difference between
the proportion of patients having four different diseases based on their being overweight or
not. it means that there is no evidence to conclude that patients who are overweight have a
higher proportion of a particular disease than the other diseases(s). Therefore, there is no
relation between weight and four types of diseases {related to heart, kidney, liver, and lungs).
The difference in the incidence of diseases (measured by the number of patients) is a random
variation or chance variation that may be attributed to other factors but not to the weight of
patients,
Chi-square test has been used as atest of the relationship between two categorical variables
to conclude that there is no relationship between the type of disease and the weight of
patients. In other words, these two categorical variables are independent. The following
section covers the applicability of chi-square as atest of independence.
Chi-square is also used as a test of independence between two categorical variables. In atest
of the difference between proportions, there is one factor of interest with two or more levels
(representing independent proportions) and a categorical variable with two levels
(representing items of interest and items not of interest). In a test of independence, the
181
UNIT 11
Chi-Square Test
second categorical variable can also have more than two levels. Thus, a test of independence
has two factors of interest each with two or more levels. The contingency table has r rows
and ccolumns.
Hypothesis Testing
Solution:
Step 1: Hypothesis Formulation
Fo Fe (fo-fey'/fe
25 11.67 15.24
40 23.33 11.90
22 29.17 1.76
18 40.83 12.77
10 13.33 0.83
27 26.67 0.00
47 33.33 5.60
36 46.67 2.44
7 10.56 1.20
14 21,11 2.40
39 26.39 6.03
35 36.94 0.10
8 14.44 2.88
19 28.89 3.39
17 36.11 10.11
86 50.56 24.85
> (fo-fe)’/fe= 101.50 183
UNIT 11
Chi-Square Test
xX’ = 101.50
Table value X’ value with 9 degrees of freedom and a = 0.05 is ((4-1) X (4-1)) = 16.919
Since calculated X’ value (101.50) > the table X’ value (16.919), we reject the null hypothesis
(the type of disease is independent of the liquor consumption frequency of patients). We
conclude that there is a relationship between two categorical variables, the type of disease
and the liquor consumption frequency of patients and that the type of disease depends on
the liquor consumption frequency of patients. The variation in the number of patients
between different types of diseases is not just by chance but is attributed to liquor
consumption.
With the chi-square test of the difference between proportions, we can only conclude that
proportions are not equal but cannot conclude which proportions differ. To determine which
proportions differ, a multiple comparisons procedure like the Marascuilo procedure, which is
out of scope in the present discussion, can be used.
The following contingency table provides results of a survey conducted for the primary
reason for shopping online for customers of three popular e-commerce platforms:
Table 11.16: Primary Reason for Shopping Online
Q.2 Canwe test the hypothesis of difference among the proportion of customers of three
e-commerce platforms based on their primary reason for shopping by using the chi-
square test? Give the reason in support of your argument.
Q.3 Can we test the hypothesis of the relationship between the primary reason for
shopping and the preferred e-commerce platform?
In addition to the test of hypothesis about the difference in population proportions and
independence between two categorical variables, chi-square can also be used to decide
184
OPMC001
Business Statistics
In the case of a super-specialty hospital medical director wants to examine whether the
prevailing theory (that proportion of patients who are overweight and those who are not
overweight are equal) holds true or not when compared with the observed distribution of
the weight of patients. If the observed frequencies are close enough to the expected
frequencies we can conclude that there is no significant difference between the theoretical
and observed distributions. Otherwise, we may cast doubts about the prevailing theory that
the probability of both types of patients (overweight and not overweight) is the same. We
use the procedure of chi-square discussed in previous sections to accept or reject our null
hypothesis.
Distribution of Weight of Patients
Number of Patients
Overweight 293
Total 450
Hypothesis
Alternative Hypothesis H,: 1, #1, states that two population proportions are not equal.
Wewill test the above states' hypotheses at a 5% level of significance.
fo fo (fo-fe)’/fe
Overweight 293 225 20.55
Not Overweight 157 225 20.55
X’= 41,102
Table X’ value with 1 degree of freedom ata = 0.05 is 3.841
Since calculated X’ value (41.102) > the table X’ value (3.841), we reject the theoretical
proposition that two types of patients (overweight and not-overweight) are equal in number.
The present sample does not provide any evidence in support of the theory.
Example 11.4: A salesman has five accounts to visit per day. It is assumed that the variable
(sales) follows a binomial distribution with the probability of selling each account being 0.4.
Given the following distribution of sales per day, can it be concluded that the data of sales
follow a binomial distribution at a 5% level of significance?
Table 11.6
Frequency
of Number of Sales} 19 | 41 | 60/20! 6 | 3
Solution:
Step 1: Hypothesis Formulation
Null Hypothesis H,: A binomial distribution with a probability of 0.4 is a good description of
sales, and
Alternative Hypothesis H,: A binomial distribution with a probability of 0.4 is not a good
description of sales.
Degrees of freedom = Number of categories—1=6-1=5
Significance Level: 0.05
P(X=x)= np tip)” Where p=probability =0.4, n=5 and x=the number of times, N= total
x (n-x)
number of observations=10+41+60+20+6+3=140
x fo P(X) fe = N x P(X)
0 10 0.07776 10.8864
1 41 0.2592 36.288
2 60 0.3456 48.384
3 20 0.2304 32.256
4 6 0.0768 10.752
5 3 0.01024 1.4336
186
N=Sfo= 140 1 Sfe= 140
OPMCO001
Business Statistics
fo fe (fo-fe}'/fe
10 10.8864 0.072173
41 36.288 0.611854
60 48.384 2.788762
20 32.256 4.656794
6 10.752 2.100214
3 1.4336 1.711502
> (fo-fe)’/fe = 11.9413
X’= 11.9413
Table value X’ value with 5 degrees of freedom and a = 0.05 is 11.071
Since calculated X’ value (11.9413) > the table X’ value (11.071), observed frequencies are
too far away from the expected frequencies to conclude that the variable, sales, follow a
binomial distribution. Therefore, we fail to accept the null hypothesis.
Q.1 From a sample survey of examination results of 500 students, it was found that 220
had failed, 170 had secured third class, 90 were placed in second class and 20 got first
class. Are these figures commensurate with the general examination result which is in
the ratio 4:3:2:1 for the various categories, respectively at a 5% level of significance?
11.7 LETUSSUM UP
When there are more than two populations (or proportions) hypothesis testing for the
difference between proportions is not possible by the procedures learned in previous units.
To test hypotheses under such situations along with testing the hypotheses of independence
(or relationship) between two categorical variables, and testing the goodness of fit of a
distribution we use the chi-square test. The hypothesis testing procedure uses a test statistic
that is approximated by a chi-square (X’) distribution.
To conduct a chi-square test, sample data consisting of observed frequencies of categorical
variables are recorded in a contingency table and expected frequencies are calculated based
on the proportions of items of interest. A chi-square statistic then compares the observed
frequencies against expected frequencies to check whether the two frequencies are
significantly far away to accept or reject a null hypothesis. If there is no significant difference
between the two frequencies, the null hypothesis of no difference between population
proportions holds, otherwise not. This requires comparing the calculated X’ value with the X° 187
UNIT 11
Chi-Square Test
NOTES value derived from a table with a particular degree of freedom and a particular level of
significance. If the calculated X’ value is less than or equal to critical X’ value we fail to reject
the null hypothesis otherwise not.
The same procedure is followed to test the difference between two proportions or between
more than two proportions, to test the relationship between two categorical variables with
any number of levels, and to test the goodness of fit of a theoretical distribution. This way will
be able to decide how far our assumptions about a distribution hold before we make
decisions based on such assumptions.
11.8 KEYWORDS
Contingency Table : A table used to record joint responses of categorical variables such that
the totals of row-counts and column-counts are equal.
Levine, David M., Stephan, David F., and Szabat, Kathryn A., 2016, Statistics for Managers
Using Microsoft Excel, 7" edition, Pearson India Education Services Pvt. Ltd., Noida.
Levine, Richard L., Rubin Davis S., Rastogi, Sanjay, and Siddiqui Masood Husain, 2013,
Statistics for Management, 7" edition, Pearson Education Inc., Noida.
Shrivastava, T.N. and Rego Shailaja., 2008, Statistics for Management, Tata McGraw-Hill,
New Delhi.
Gupta, S.P. and Gupta M.P., 2010, Business Statistics, 16" edition, Sultan Chand & Sons, New
Delhi.
Q.2 How many degrees of freedom will be required to calculate table X’ value for a
distribution of sample size n with r rows andc columns?
(a) n—-(rxc)
(b) (r—1)x(c-1)
(c) rxec
(d) (r—1)+(c-1)
Q.3 Which of the following is not required to test a chi-square test hypothesis?
Q4 The minimum value of the expected frequency of every cell to apply the chi-square
test should be:
(a) 5
(b) 10
(c) 1
Table 11.17
Geographical Region
189
UNIT 11
Chi-Square Test
Table 11.18
Q.1 Yes
Q.2 No
Q.3 False
CHECK YOUR PROGRESS - ll
Q1
x Y Total
l 12 48 60
Il 28 112 140
Q1 6
Q.2. No, because a chi-square test for testing the difference between proportions can be
applied ifthe variable of interest has only two levels {one of interest and the other not
of interest). In this case we have four primary reasons for shopping.
Q.3 Yes
190
OPMCO001
Business Statistics
0.4
fo fe (fo-fey'/fe
Failed 220 200 2
First Class 20 50 18
X’=23.67
Table value X’ value with 3 degrees of freedom and a=0.05is 7.815
Since calculated X’ value (23.67) > the table X’ value (7.815), the observed frequencies are
significantly different from the expected frequencies. Therefore, we fail to accept the null
hypothesis that results are in the ratio 4:3:2:1. We conclude that results are different from
the general ratio of 4:3:2:1 across various categories.
Q1 (c)
Q.2 (b)
Q.3 (d)
Q.4 (a)
Q.5 Hypothesis
If 7,, TL, TU,, and ml, represent consumer proportions of four geographical regions the
corresponding hypothesisis,
Geographical Region
fo fe (fo-fe)'/fe
130 140 0.71
X’=4.69
At a=0.10 with 3 degrees of freedom table value X’=6.251
Since calculated X’ value (4.69) < the table X’ value (6.251), the brand share is found to be
equal across four geographical locations.
Alternative Hypothesis H,; Number of customer arrivals per hour doesn’t follow a Poisson
distribution.
Q.9 Expected Frequencies
1 57 0.15 59.76
2 98 0.22 89.64
3 85 0.22 89.64
4 78 0.17 67.23
192
OPMCO001
Business Statistics
20 19,92 0.00
57 59.76 0.13
98 89.64 0.78
85 89.64 0.24
78 67.23 1.73
62 73.79 1.88
X’ =4.76
At a = 0.10 with 5 degrees of freedom table value X’ = 9.236
Since calculated X’ value (4.76) < the table X’ value (9.236), the null hypothesis, the number
of customer arrivals per hour follows a Poisson distribution, is true at 10% level of
significance.
193
SIMPLE LINEAR REGRESSION
STRUCTURE
12.0 Objectives
12.1 Introduction
12.2 Application of Regression in Business and Management
12,3 Types
of Relationship
12.4 Measuring Simple Relationships
12.5 Simple Linear Regression
12.6 LetUsSumUp
12.7 KeyWords
12.8 References and Suggested Additional Readings
12.9 Self-Assessment Questions
12.10 Check Your Progress— Possible Answers
12.11 Answers to Self-Assessment Questions
12.0 OBJECTIVES
After reading this unit, you will be able to learn:
e types ofrelationships
® meaning of correlation and various methods of measuring the correlation
between variables
e meaningof regression
e estimating the value of a dependent variable from the value of an independent NOTES
variable
12.1 INTRODUCTION
The word relation or relationship is commonly used to describe events and situations
experienced by us in our day to day lives. Mothers inculcate the habit of cleanliness among
their children to keep them healthy. You have grown up listening that hard work leads to
success. The teacher of a management institute helps in improving the communication skills
of her students for their successful career in management. Astrologers relate the events in
an individual's life with the position of stars. There are countless situations like these where
we believe that one event leads to another event because there is a spirit of the relationship
between events. When we have quantifiable information about events, we can check
whether there is a relationship between events and not. The technique used to study the
relationship between events (variables) is called correlation.
Let’s understand this by taking an example of the demand for ice-cream - related to
maximum temperature during the day.
Example 12.1: If you visit India Gate, the historic monument that commemorates soldiers
who lost their lives in World War I, one thing that cannot escape one’s attention is ice-cream
pushcarts. India Gate is the country’s single largest marketplace for selling ice-cream and
accounts for almost 11% of overall ice-cream sales in the capital. It is also a favourite ground
for ice-cream manufacturers to test their products and offerings. Estimated sales of ice-
cream on India Gate on peak summer weekends are Rs. 18-20 lakhs per day suggesting a
relationship between the demand for ice-cream and temperature during the day.
As monsoons are approaching, an ice-cream manufacturer wants to predict its demand for
the coming weekends by using the data of the quantity of ice-cream sold and maximum
temperature readings on weekends. The manufacturer records the following data of
quantity sold and maximum temperature for various weekend days of June.
Table 12.1
Day June6 |June? | June 13|)June14| June 20) June 21) June 27} June 28
The maximum temperature of the day (measured in centigrade) and sales (measured in a
hundred kilograms) are quantifiable variables. If there is some evidence that there is a
relationship between these variables, this relationship between them can be studied by
using the principles of correlation.
Variables always do not relate to each other in the same manner, but they relate differently,
both in terms of direction and strength of the relationship. Correlation helps in determining
the direction and strength of the relationship between variables. Ingrained in the spirit of
195
UNIT 12
Simple Linear Regression
NOTES relationship is the spirit of dependence among variables. The quantity of weight lost by
participating in a weight loss program depends upon the number of calories consumed. The
quantity of food consumed depends upon its taste. In example 12.1, the quantity of ice-
cream sold depends upon temperature during the day. There are innumerable relationships
where some variables are dependent, and others are independent. When it is possible to
identify dependent and independent variables in a relationship, it is also possible to
establish a relationship between them mathematically and to predict the value of the
dependent variable(s) if independent variables are known. The technique used to represent
the relationship between dependent and independent variables and to predict the amount
of change in the dependent variable by changing the value of independent variables is called
regression. While correlation is concerned with measuring the relationship between
variables, regression is used to predict the variation in the value of the dependent variable
for variation in the value of the independent variable. Ice cream manufacturers can use the
concept of regression analysis to predict the demand (sales in kg) of ice cream based on
information about maximum temperature during the day.
Irrespective of the field of study, regression is used in all such instances where there is an
instance of a relationship between variables and it is possible to identify which of them are
dependent (predicted) and which are independent (predictor}. Hence, its applicability is
universal. Its application in the field of economics, business, and management can be
visualized in the following examples:
® Relationship between advertising expense and sales of a product.
® Relationship between discounts offered and units sold bya retail outlet.
e Relationship between footfall size and rental rate of shops in a shopping mall.
© Relationship between the speed of the conveyor belt and the number of defective
items produced ona shop floor.
e@ Relationship between brand equity and equity price of companies.
In a simple relationship, only two variables (one dependent and second independent) are
involved. The relationship between two variables is measured by using a simple correlation
196
OPMCO001
Business Statistics
and the value of the response variable (dependent variable) is predicted by using simple NOTES
regression. The base price (response variable) of a player in an IPL auction depends on his
average ratings (explanatory variable) during the year and can be predicted by using a simple
regression equation (regression model}.
The relationship between more than two variables (one dependent and more than one
independent) is measured by using multiple correlations, and the value of the response
variable (dependent variable) is predicted by using multiple regressions. The number of
units sold of a product (response variable} depends upon advertising expense, personal
selling expense and sales promotion expense (all explanatory variables). The number of
units sold, with more than one explanatory variable, can be predicted by conducting
multiple regression analyses.
When the direction of change in the values of variables involved in the relationship is the
same (i.e. when dependent and independent variables either increase or decrease
simultaneously) the relationship is positive (or direct). When the direction of change in the
values of variables is the opposite (i.e. when one variable increases and the other decreases
or vice versa) the relationship is negative (or indirect). Brand equity and product quality have
a positive correlation, whereas the price and demand are negatively correlated.
Q.3 Correlation analysise will not be enough in checking the direction and strength of the
relationship between the premium paid and annual income. (True / False)
197
UNIT 12
Simple Linear Regression
AScatter plot (or Scatter Diagram) is used to examine the relationship between an X variable
(independent or predictor) on the horizontal axis and a Y variable (dependent or
explanatory) on the vertical axis. The nature of the relationship can take different forms. The
following section provides a snapshot of various types of relationships obtained by using to
scatter plots.
Fig. 12.1
Panel G: No relationship
Y @ © »
@@ 6
3-4 ee
x
| Panel H: No relationship
EX
198
OPMCO001
Business Statistics
Panels E shows that the correlation (relationship) between X and Y is strong as points are
closely scattered around a straight line. However, one relationship (in the panel E) is positive
and the other is negative.
Panel F suggests weak relationships because the points are away from the straight line.
Panels G and H show no correlation (no linear relationship) between X and Y. In panel G, the
scatter of points is not around a line or a curve. In panel H, onlyX changes and Y is consistent,
suggesting that Y does not change with X.
The smaller the distance of points from a line (or a curve) ina scatter plot, the stronger is the
relationship, and the greater the distance of points from a line (or a curve) weaker is the
relationship.
Table 12.1: Illustration Scatter Plot for data
= 20
qoO
2 0
S34 36 38 40 42 44
= Maximum Temperature (Celcius)
ny
199
UNIT 12
Simple Linear Regression
Table 12.2
Q.1_ Plot scattered diagram between the two variables and state:
(a) Whether the relationship between them - is it direct (positive) or indirect
(negative)?
___&s
Sx
. Sy
; Ex i-X)? YWi-¥)?
200
OPMC001
Business Statistics
Example 12.1;
Illustration: Calculation of Karl Pearson’s Coefficient of Correlation {r) for data.
40.38
/(8.88)(75.88)
= 0.743409
r = 0.743409 indicates a strong correlation between ice-cream sales and temperature. The
positive value of rindicates that as temperature increases sales of ice-cream alsoincrease.
Example 12.2:
The average maximum temperature (city-wise ranking) for June weekends for ten cities of
India and the average sales of ice cream (city-wise ranking) is recorded in table 12.3 below:
Table 12.3
Amritsar 2 1
Bhopal 3 4
Mysore 9 8
Nagpur 4 3
Patna 5 6
Ranchi 7 7
Srinagar 8 10
Vijayawada 10 9
Vishakhapatnam 6 5
Solution 12.2: Difference of Ranks (D) and other calculations are shown below:
Amritsar 2 1 1 1
Bhopal 3 4 “1 1
Mysore 9 8 1 1
Nagpur 4 3 1 1
Patna 5 6 #1 1
Ranchi 7 7 0 0
Srinagar 8 10 -2 4
Vijayawada 10 3 1 a
Vishakhapatnam 6 5 1 1
n=10 > D’= 12
202
R =1-—&
~~ 403-10
~ 0927273
OPMCO001
Business Statistics
R = 0.927273 indicates a high correlation between ice-cream sales and temperature. The NOTES
positive value of R indicates that as temperature increases sales of ice-cream also increase
and hence the correlation is also positive.
Sometimes when ranks are not provided, we can obtain the rank from the given values of
variables
X and Y to calculate
the value of Spearman’s coefficient R.
Note: Values of Karl Pearson’s Coefficient of Correlation (r) and Spearman’s Correlation
Coefficient (R) calculated for the same data may not be the same.
Also, which one of the above two is a better reflection of the relationship between the
variables?
A variable whose value is predicted is called a dependent variable because its value changes
to changes in the value of the independent variable(s). The dependent variable is also called
explained variable, response variable, output variable, predicted variable, or regressand.
203
UNIT 12
Simple Linear Regression
Q.2. It is meaningless to decide which of the above variables are dependent and
independentto predict the premium amount. (True/False).
Regression Analysis is the process of constructing a mathematical model of function that can
be used to predict or determine one variable by another variable. Itinvolves the construction
of a linear equation, called a regression equation, that describes the relationship between
the dependent and independent variables.
The general regression model for population data is given below:
LY = 6B, +B,X;,+ €;
The terms used in the model are shown below:
Random
Population _crPwation independent eae
a
Y intercept Variable term
Coefficient
WY BX. + €,+€.
i =B, + B,X,
——~
In the above model, the Y=8,+B,X, portion is a straight line. The slope of the line, B,,
represents expected change in Y per unit change in X. It represents the mean amount that Y
changes (either positively or negatively} for a one-unit change in X. The Y-intercept, B,,
represents the mean value of Y when X equals 0.
The last component of the model, €,, represents the random error in Y for each observation,
i. nother words, ¢, is the vertical of the actual value of Y, above or below the expected value Y,
ontheline.
204
OPMC001
Business Statistics
Fig. 12.2, below, provides the model summ
ary for population values of X (indepen
variable) and Y (dependent variable). dent
Fig. 12.2
Y Yi =Bo +B,X; +,
Observed Value
ofY forX,
Predicted Value
of Y for xX,
Xj X
12.5.2.1 SIMPLE LINEAR REGRESSION EQUA
TION
In the preceding section, we discussed the regre
ssion model that describes therelationship
between variables for the population unde
r consideration. However, practically, the
collected from a sample. If certain assumpti data are
ons are valid, we can use the sample Y
b,, and the sample slope b,, as estimates of the intercept,
respective population parameters, B, and B,.
The equation below uses these esti mates
to forma simple linear regression equation.
Y=botb,X_, eeeese Equation 12.4
The straight-line constructed using samp
le data (eq. 12.4) is often referred
prediction line. The following terms are to as the
used in the above model constructed from
values of Xand Y. sample
Estimated
least squares.
Because this equation has two unknowns, b,and b,, the sum of squared differences depends
on these two unknowns. The least-square method mathematically determines the values of
b,and b, that minimizes the sum of squared differences around the prediction line. Any
values of b,and b, other than those determined by the least-squares method result in a
greater sum of squared differences between the actual (Y,) and predicted (¥) values. The
process of regression model construction involves calculating the values of regression
coefficients b, and b,. Substituting the values of b, and b, in the above equation provides the
resultant regression model that is used to analyze the relationship and predict the values of
the dependent variable.
The least-squares method provides the following equations to calculate the values of b,
(slope of the line) and b, (intercept),
bi=
2X) YD Equation 12.5
L(%i-X)
or
206
OPMCO001
Business Statistics
= 40,38 =38.88
YX=319 | YY=395
40.38
bi = = 1,038585
38.88
Y¥=7.961415 +1.038585 X,
Interpreting the values of Slope (b,) and intercept (b,)
In the above regression model (prediction line) slope b, means that sales (predicted value of
a dependent variable, Y) are estimated to increase by 103.8585 kg (or 1.038585 x 100 kg) for
each one-degree increase in temperature (independent variable, X). That also means that
for each one-degree decrease in temperature, sales of the manufacturer would decrease by
103.8585 kg. A minus value of slope b, would suggest that as X increases, Y decreases and
vice versa. Thus, the slope represents the portion of sales that are estimated to vary
according to the temperature.
The Y-intercept, b, represents the predicted value
Y when X = 0. But, the value of b, should be
interpreted cautiously according to the type of variable under consideration. As in our
example, temperature (X) is an interval data, where X = 0 does not mean an absence of
temperature altogether. Since X = 0, in this case, mean no temperature, the value of b, =
7.961415kg (When, X = 0) may not be meaningful.
207
UNIT 12
Simple Linear Regression
NOTES Example 12.3: A consumer research firm wants to predict the weekly expense of consumers
based on their weekly income. Based on the data collected from the consumers’ regression
model, Y= 250 + 200 X, is obtained. Slope value (b, = 200) represents that for every Rs 1,000
earned, the spending increases by Rs 200. A consumer who earns Rs 10,000 per week is
estimated to spend Rs 2,250 (250 + 200 x 10). Intercept value (b, = 250) indicates that when
the consumer does not earn (X=0), his monthly income is predicted to be Rs 250. In this
example value of b,, unlike in the previous example, is meaningful.
Slope: Rate of change in the value of the dependent variable to the independent variable or
change observed in the value of the dependent variable when the independent variable
changes by one unit.
Intercept: Value of the dependent variable when the value of the independent variable is
zero.
Let’s use the prediction line Y= 7.961415 +1.038585 X, to predict the demand for ice-cream
fora particular Sunday having a forecasted maximum temperature of 37 degrees Celsius.
The value ofY for X = 37 will be 46.38906. Therefore, the demand for ice-cream on a day with
a maximum temperature of 37 degrees Celsius will 4638.906 kg.
Q.2 Predicting the values of the dependent variable by extrapolation using regression
analysis provides accurate results. (True/False).
208
125.L.5RESDUALS
Let’s calculate the predicted values of ¥ by using the regression model, ¥= 7.961415 +
1.038565 X developed for example 12.1
36 44 45.35 “135
42 49 51582 -2.58
41 43 50.543 2.54
40 52 49.505 2.495
38 47 47.428 0.43
41 50 SO543 0.54
43 55 52.621 2.379
Fig. 12.3
y= 10386x + 7.9614 ¢
60
.
sacenetennnenreenenette
eee®
acenee
50
eccnco
‘ cae ucaonede ©
qecesa
2
40
30
20
10
36 37 38 39 40 41 42 43 44
UNIT 12
Simple Linear Regression
The regression equation developed for example 12.2 is shown in Fig. 12.5. As the graph
clearly shows, all values are scattered around the regression line. The vertical distances
between the observed values of Y and the predicted values of Y (estimated on the prediction
line) are residuals. As discussed in the previous section, the method of least squares tries to
minimize these distances in aggregate. It calculates such values of b, and b, and obtains a
regression line that minimizes these vertical distances of all the data points. Even though the
data values don’t lie on the line, still this line is the best line of fit for all points.
Any regression model explains some proportion of variation in the value Y (caused by X) and
at the same time fails to explain some proportion of variation in the value of Y. These will
typically be the points far off from the scatter line. However, if they are few in number and
random numbers, the model can still be used. The portion of the value of Y which remains
unexplained by the regression model is the error term (g).
Q.1 Calculate the predicted values of premium amount using the observed values of
annual sales (givenin Table 12.2). Also calculate residuals.
Total Sum of Squares (SST) is the variation ofY values around their mean, Y.
n
SST = Y (Yi - Y). sesssentensnensons Equation 12.8
i=1
SST is divided into two parts, explained variation and unexplained variation, where Y values
are the observed ones.
SST=SSR+SSE
Error Sum of Squares (SSE) also called error variation is the variation in the value of Y due to
factors other than the relationship between X and Y. It represents that part of the variation in
210
OPMC001
Business Statistics
The ratio of SSR to SST measures the proportion of variation in Y that is explained by
independent variable X. This ratio is called the coefficient of determination r’.
The standard error of the estimate measures the variability of the observed values of Y from
the predicted values in the same way as standard deviation shows the variability of original
values from their mean. In other words, the standard error of the estimate, S,, is the standard
deviation of the regression model calculated as below:
Let’s calculate and interpret the values of measures of variation, coefficient of determination
and standard error estimate for example 12.1.
211
UNIT 12
Simple Linear Regression
ssr=) no.
(Wi-Y)—.2 = 41.9329
i=1
Error Variation
sse=) "
(vi- caeYi)” iD. =33.9421
i=1
Total Variation
n 2
SST =». (vi-Y) = SST=SSR+SSE=75.875
i=1
Approximately 76% variation is observed in the value of sales (dependent variable, Y). Out of
which approximately 42% variation in the value of sales (Y) is caused due to variation in the
value of temperature (independent variable, X). In other words, a 42% variation in the value
of sales is explained by its relationship with temperature. Approximately 34% of the variation
(error variation) in sales is not explained by the model due to the difference between the
observed values and predicted values ofY (vertical distance between points and the line as
seen in the diagram above). This 34% of the variation may be attributed to factors other than
the relationship between temperature (X) and demand (Y).
These measures of variation by themselves provide little information. Their ratios are more
meaningful and provide insights on the accuracy of the regression.
212
OPMCO001
Business Statistics
Many variables in real life are related to one another in some way or the other. It becomes
crucial for managers to understand the nature of relationships between variables to make
important business decisions. This relationship between variables can be simple or very
complicated depending upon the number of variables involved and how they move to one
another. A simple linear regression model that involves only two variables that are related to
each other only linearly is a starting pointto understand these relationships.
While methods of studying correlation like the scattered diagram method, Karl Pearson’s
coefficient, and Spearman’s Rank coefficient can be used to measure the direction and
strength of the relationship between two variables, a simple linear regression model is used
to predict the values of one dependent variable (Y) based on one independent variable (X).
The simple linear regression model is based on the method of least squares that calculates
such values of intercept (b,) and slope (b,) to obtain the prediction line that minimized the
vertical distance between the actual values and predicted values (on the line) of the
dependent variable (Y).
The slope (b,) of the line of prediction indicates the rate of change in the dependent variable
(Y) for changes in the independent variable (X), and intercept (b,) indicates the possible value
of the dependent variable (Y) when the value of the independent variable (X) is zero. Based
on these values of regression coefficients line of prediction (regression model) is developed
to predict the value of Y from the values of X. Since points don’t strictly lie on the prediction
line there is some difference between the observed and predicted values of Y. This difference
is known as residual which points to error term (€) of the regression model. As the regression
model is not generally used to extrapolate, we use interpolation to predict the value of Y
fromthe model.
To measure the variation caused in the value of the dependent variable (Y) due to variation in
the value of the independent variable (X) measures of variation are calculated. Three
measures of variation are SST (total sum of squares) or total variation, SSR (regression sum of
squares) or regression variation, and SSE (error sum of squares) or error variation. Since
these measures in themselves don’t communicate a lot, their ratios are calculated. The ratio
of SSR to SST called a coefficient of determination is used to explain the variation in the value
ofY explained by the relationship between X and Y. The remaining proportion that remains
unexplained by the relationship between X and Y is the error part. The standard error of the
estimate calculated from SSE, expressed in the same units as that of the dependent variable,
is the standard deviation of the regression model.
213
UNIT 12
Simple Linear Regression
Independent Variable: Variable used to predict the value of the dependent variable.
Intercept: Value ofY for X=0 in the regression equation.
Interpolation: Using minimum and maximum observed values of X to form a relevant range
for which Y will be estimated by the regression model.
Karl Parson’s Coefficient of Correlation: A unit free term that has a magnitude and sign to
measure the strength and direction of the relationship between two variables.
Method of Least Squares: The method that minimizes the sum of squared deviations
between observed and predicted values of Y for calculating the value of the slope and
intercept of the prediction line.
Multiple Regression: The regression model used to depict the relationship between more
than two variables.
Prediction Line: The line obtained such that the vertical distance between observed values
and predicted values of
Y is minimum.
Scattered Plot: Two-dimensional diagram used to show the pattern of relationship between
two variables.
Simple Linear Regression Model: Regression model consisting of a linear relationship
between
two variables.
Slope: Rate of change in Y to X.
Spearman’s Rank Correlation Coefficient: A measure used to calculate correlation when the
data of variables are available as rank orders.
Standard Error of the Estimate: The standard deviation of the regression model.
Levine, Richard L., Rubin Davis S., Rastogi, Sanjay, and Siddiqui Masood Husain, 2013,
Statistics for Management, 7th edition, Pearson Education Inc., Noida.
214
OPMCO001
Business Statistics
Shrivastava, T.N, and Rego Shailaja., 2008, Statistics for Management, Tata McGraw-Hill,
New Delhi.
Gupta, S.P. and Gupta M.P., 2010, Business Statistics, 16th edition, Sultan Chand & Sons, New
Delhi.
https://economictimes.indiatimes.com/industry/cons-products/food/india-gate-indias-
single-largest-ice-cream-selling-point/articleshow/5753043.cms?from=mdr
https://timesofindia.indiatimes.com/city/delhi/Delhi-slurped-ice-cream-worth-Rs-20-
lakh-this-weekend-at-India-Gate/articleshow/47580940.cms
https://www.accuweather.com/en/in/delhi/202396/june-weather/202396
(b) isLinear
{c) isnonsensical
(d) ismultiple
Q.2 Therelationship between the price of acommodity and quantity demanded is:
(a) Linearand positive
(d) Non-linear
and negative
Q.3 Fitting astraight line to a set of data yield the regression model Y= 5.5 + 20.05 X,. What
will be the estimated value of Y for X= 15?
{a} 15
(b) 103
(c) 25.55
(d) 306.25
Q.4 Forthe regression model Y= 5.5 + 20.05 X, the value of residual for (X, Y) = (20, 400)
will be
(a) -6.5
(b) 6.5
(c) 406.5
215
UNIT 12
Simple Linear Regression
(d) -406.5
Q5 The slope of aregression line can be defined as:
(a) Change in the value of the dependent variable when the independent variable
doesn’t change.
(b} Value of the dependent variable when the value of the independent variable is
zero.
(c) Change in the value of the dependent variable when the independent variable
changes by one unit.
(d) Change in the value of the dependent variable when the independent variable
change by the same quantity.
0.6 The least-squares method minimizes the sum of the squared difference between:
(a) Observed values and mean value of the dependent variable.
(c) Observed values of the independent variable and predicted values of the
dependent variable.
(a) Predicting the value of the dependent variable for any value of the independent
variable.
(b} Predicting the value of the dependent variable for a given range of values of the
independent variable.
(c) Predicting the value of the dependent variable for such values of the independent
variable which are beyond the range.
(d} Predicting the value of the dependent variable for a range of values of the
independent variable which are obtained only from the data used to develop a
regression model.
Q9 Which of the following statements is incorrect about Karl Pearson’s Coefficient (r) and
Spearman’s Rank Coefficient (R) of correlation?
(a) Value Rand ris always the same fora given data
(b) Values of both lie between -1and1
216
{c} Both are Interpreted sirilarty
(d) FR ls based on ranks, and r Is basad on values of dapandent and Indepandent
variables
0.10 Thesimple linear regression medel is based upon:
(a) More than two verlables that can be plotted asa stralght line.
(dj) Any number of variables that can be plotted as a straight IIne but at least one
variable should be categorical.
20 - @
uu @
a 45 e ® e
5 e
= 10 4 e e
- e@
0 T T tT |
0 5 10 15 20
Annualincome
{a} «= Director (positive) indicating that both variables move in the same direction.
(b) The relationship between them is moderate as the data values are neither highly
Sseattered nor packed together.
2i7
UNIT 12
Simple Linear Regression
Q.1 Predictor variable is annual income and the response variable is the premium paid.
Q.2 False
CHECK YOUR PROGRESS -V
Intercept Value b,= 5.912088 indicates that when annual income is zero premium paid
would be Rs. 5912.088. But it is highly unlikely that a person who doesn’t earn will
purchase an insurance policy and pay the premium. Therefore, the value of the
intercept should be interpreted carefully.
CHECK YOUR PROGRESS
- VI
The regression model shall make the best predictions for the premium amount for
annual sales amount between Rs 5 and Rs 18 Lakhs.
Q.2. False
218
OPMC001
Business Statistics
8 18 12.21245421 5.78755
7 10 11.42490842 -1.4249
5 6 9.84981685 -3.8498
6 7 10.63736264 -3.6374
13 15 16.15018315 -1,1502
18 . 20 20.08791209 -0.0879
10 12 13.78754579 -1.7875
9 11 13 -2
Q.12 Model Standard Deviation, S,, = 3.628685 thousand rupees or Rs. 3628.685. It is
known as the standard error of the estimate.
Q5 (c)
Q6 {d)
Q7 (b)
219
UNIT 12
Simple Linear Regression
NOTES Q8 (d)
Q9 (a)
Q.10 (b)
220
OPMC001
Business Statistics
SIMULATION
STRUCTURE
13.0 Objectives
13.1 Introduction
13.2 Definition
13.3. Applications
of Simulation
13.4 Advantages and Disadvantages
13.5 Types
of Simulation
13.6 Stepsin simulation
13.7 Random Numbers
13.8 Monte Carlo Simulation
13.9 Applications
of Simulation
13.0 OBJECTIVES
After reading this unit, you will be able to:
e understand what simulation is and how it aids in the analysis of a problem
e identify the important role probability distributions, random numbers, and the
computer play in implementing simulation models
e realize the relative advantages and disadvantages of simulation models.
13.1 INTRODUCTION
A simulation is a computerized model that replicates the operation of the real world,
providing a realistic and enticing experience to the learner. Before any important event, we
perform rehearsals for the smooth functioning of the event. It helps to locate the pitfalls and
rectify the problems before the “real thing”. Simulation models have importance in the
aviation industry where the proposed aircraft is tested for its aerodynamic properties before
the final model is made, disaster management where simulation techniques are used to
221
UNIT 12
Simple Linear Regression
NOTES create the conditions similar to a natural disaster (the well-known fire drill), spacecraft,
training of pilots through simulators, etc. So that the team is well trained to be ready for
rescue operations in case of a disaster. Defense uses simulation games to prepare the
soldiers for an attack, and to strategize war techniques, Simulation also helps in quantifying
the relationships among complex variables that cannot be solved mathematically. To obtain
the best learning outcomes simulation has been used even in fields like physical sciences,
engineering, statistics, finance, etc. A simulation model is prepared using the assumptions
onthe operation of the system. Hence, simulation models are very helpful in developing vital
skills required by an employee to perform a particular job productively.
13.2 DEFINITION
Due to its flexible methodology, simulation is applied to all the fields. One of the greatest
strength of simulation is to answer “what - if” questions. Below are some of the real
applicationsof simulation
e Medical researchers use animals to stimulate the effect of a new drug before it is
before introducing them to human beings.
e Fire fighters conduct various drills to prepare their team in the time of need.
e All the commercial pilots are trained in a simulator and exposed to extreme
weather conditions before they fly an aircraft.
® The automobile industry stimulates accidents to test the safety of a car when it
meets with an accident.
e Thesetting of the stock level to meet the fluctuating demand at retail stores.
e Bidding
for drilling projects
When should we Simulate?
Simulation is used when we are dealing with the problems that are very complicated and
there are no optimal solutions or when it is very risky or costly to experiment with real
situations e.g. when a person enrolls for driving or flying. To teach driving, the person is
exposed to stimulators to train them on basic operations of driving as itis very risky to take an
immature driver directly on the road where he may lose control and get embroiled in serious
accidents. For successful training, it is important that simulation adequately imitates the real
222
OPMCO001
Business Statistics
conditions. NOTES
In general, simulation is used when
e Whenitis not advisable to experiment with the real system (fighter pilot).
e Tostudyasystem that deals with uncertainty.
The main advantage of simulation is that unlike the deterministic model the variables are not
fixed in advance, we can vary them randomly and ascertain how the system behaves, and
what happens to key decision variables when the values are changed. In this way, we
determine the range of possible outcomes with their associated probabilities, and a
sensitivity analysis is carried out to figure the best possible outcome. Simulation is very
important under the scenario when the experiment with the real situation is costly and risky.
Simulation has become a popular tool for decision-making in many areas due to some of the
reasons shared below:
e = It is easy and flexible.
With so many advantages of simulation, there are some disadvantages too with the
model
e |t is based on a trial and error approach which generates different output in
different runs. 223
UNIT 13
Simulation
NOTES The simulation model does not provide any solution by itself the user has to
specify the constraints for which the modeling is to be done.
At times people don’t take it seriously or suffer fatigue if they are made to practice
on simulation model as they know that it is just a virtual exposure.
Difficult for many people to understand the abstraction as the solutions are based
on virtual modeling.
Ina discrete system, the changes in the system state are discontinuous i.e. state of
variable changes only at a countable number of points in time. The change in the
state of the system is called an event. E.g. the number of cars serviced, the
number of complaints received, and the arrival or departure of customers in a
queue.
In other situations, the system changes smoothly with time i.e. the variable of interest can
assume either integer or non-integer values e.g. weight, height, volume, etc. Thus the
distinction between variables whether discrete or continuous is important before applying
simulation design.
Fixed Interval Versus Next Event Simulation
The other type of simulation emphasizes on when an event occurs, the computer
programmed to generate the time to the occurrence of the event - for example in
a production line if the machine breaks down the manager is interested how long
the machine operated between breakdowns and how long it will take to repair the
machine.
So, if the interest lies in the occurrence of an event or how much time or effort is required,
the fixed interval or next event simulation is used.
Deterministic Versus Probabilistic Simulation
A deterministic model is the one under the same initial conditions that always
gives the same results every time we run the model. Most of the mathematical
224
OPMC001
Business Statistics
The procedure of sampling from probability distributions is known as Monte Carlo Sampling.
It is based on the frequency distribution of probability. The sample generated using this
procedure should be independent. As the probabilities are calculated to 2 decimal places,
adding up to 1, we need 100 numbers of 2-digits to represent each point of probability. A
random number between 00 to 99 are used to represent the same. Moreover, each random
number in a sequence of 00 to 99 has an equal probability of showing up, and it is also
independent
of any number shown.
Excel has two functions
for generating random numbers.
RAND() and RANDBETWEEN(), both uses the probability density function of continuous
uniform distribution U(0,1).
RAND() function for generating “random” numbers, as the numbers coming from a formula
and hence called pseudo-random.
RANDBETWEEN (low, high) generates a pseudo-random # between low and high, where all
#’s are equally likely.
The Monte Carlo method was invented by John von Neumann and Stanis law Ulam in the
1940s and sought to solve complex problems using random and probabilistic methods. The
term Monte Carlo refers to the administrative area of Monaco where European elites
gamble. 225
UNIT 13
Simulation
NOTES There are different types of simulation but we would be focusing only on probabilistic
simulations. When we work with a small group of data, the random behaviour of variables
can be mapped by drawing of cards, rolling of dice, flipping of a coin, spinning an arrow ona
common clock, using published tables of a random number, etc. A specific numerical value is
assigned to each of these possibilities of the outcome of the experiment. Though it is a very
simple technique, it is a very time consuming and cannot meet practical requirements when
there is a large number of outcomes of the experiment, like when decision-makers are
interested to know the possibility of an accident in a laboratory due to a radiation leak, or the
number of the breakdown of machines in a production line, the demand of a product, etc., As
there could be a large number of outcomes in these situations the above-mentioned manner
of randomization may not be feasible. In such a case it is convenient to use computer-
generated random variables. Monte Carlo Simulation is a form of computer simulation - a
mathematical technique that generates random variables for modeling risks or uncertainty
of a certain system, using different probabilities to predict the outcome which is difficult to
observe in reality due to the random nature of the variable. The random variables are
generated based on probability distributions such as normal, exponential, etc. Monte Carlo
Simulation is used when the model has uncertain parameters or when the system is very
complex. Its roots lie before World War Il, where pilots and infantry soldiers were trained
with simulators and mock-ups to prepare for battle. Not just in warfare techniques,
simulation has also spread its wings in all most all domains like finance, engineering, supply
chain, physical science, computational biology, statistics, artificial intelligence, quantitative
finance, etc.
Nowadays, even though there is no dearth of information, it is still very difficult to predict the
future with accuracy. In such situations, Monte Carlo Simulation comes to our rescue as we
can visualize the outcomes of the decision which further helps in optimizing better decisions
under uncertainty. As Monte Carlo Simulation uses a probability distribution function for
modeling random variables, different probability distributions generate different outcomes.
In this way, decision-makers obtain a feel about not just what to expect but the probability of
occurrence of that particular outcome. In this manner, it is possible to model the association
between random variables.
Example 14.1 To illustrate, consider that the bakery maintains the record of the sales of
multigrain bread for2 months (i.e. 60 days)
Based on the above data we can estimate the probability distribution of demand by
converting the frequencies in probability. Hence, the above data can be represented as a
probability distribution table shared below:
Probability | 0.05 | 0.18 | 0.12 | 0.1) 0.17) 0.2 | 0.15) 0.03/ 1.00
226
OPMCO001
Business Statistics
Hence, from the data, we can say that there is a 5% probability that 5 loaves of multigrain NOTES
bread would the demanded on a day, an 18% probability that 6 loaves of multigrain bread
would be demanded, and so on. In this way. The above table serves as a model of simulation
under consideration. We can simulate the model and try to capture the randomness
associated with the demand for loaves of multigrain bread. There are various ways in which
random numbers may be generated, but with computers, it is usually very easy to generate
random numbers.
Steps involved in the Monte Carlo Approach
The steps involved in computing simulation depends on the model applied. At times it can be
very complex if there are too many factors involved. But in general, it has 4 steps.
Step 1. Identify the model
Asimulation is built around the quantitative model of the business plan or the process and is
defined by a series of formulas using mathematical operations, that represents the
characteristics and other features of a system. Simulation can be used to estimate a simple
model to determine the profit vis-a-vis the model involving complex engineering formulas or
statistical models or financial models, etc. The model is then used to simulate to understand
how the system will behave in particular, scenarios. Simulation is also applied in forecasting
the outcomes ofa situation.
Step 2. Define the Input Parameters
After defining the model, it is also vital to express the equation for each factor and define the
distribution of the parameters. It is possible that the equation works on multiple
distributions, like some parameters may follow normal distribution while others may follow
a uniform distribution. In such a case, to compute the probability, we need to specify the
mean and standard deviation for parameters that followa normal distribution.
Fixing the input parameters is as vital as building a quantitative model. Without the precise
input, a model can never generate the precise output, i.e. the desired outcomes. Compared
to the deterministic approach, a stochastic approach will provide a more reliable conclusion.
While creating a simulation in EXCEL, you can use either of the two formulas mentioned
before to generate random numbers.
NOTES Lognormal: DIST, LOG. INV, Binomial: DIST, BINOM.INV, Hypergeometric: DIST
Beta: DIST, BETA.INV, Gamma: DIST, GAMMA. INV, Exponential: DIST, Weibull: DIST, Poisson:
DIST, Negative binomial: DIST
Unless and until a specific distribution is followed by a parameter, the default distribution
used is Normal. The syntax is NORM.INV (probability mean, standard deviation)
To randomize the results, we use the RAND function as the probability argument. The RAND
functions return value specific to the percentile of a random variable with a given mean and
standard deviation. For example NORM.INV (RAND(), 150, 25) means that the variable
follows a normal distribution with mean = 150 and standard deviation = 25,
The mean and standard deviation values should be consistent with the expected collection
of input values. For example, if you are trying to forecast next year's profits, the previous
year's sales amounts can be used as sample data. To excel has built-in functions to calculate
the mean and standard deviation.
Using the simulated data, to excel can easily calculate the outcome of the model. Most of the
variations in the parameter are captured as the model is evaluated using a large set of
random data.
Real-life applications of Simulation.
Let us consider the same bakery problem and apply the steps of Monte Carlo Simulation in
Excel.
Step1:
As the probabilities have been calculated to 2 decimal places, which add up to 1, so we need
100 numbers of 2-digits to represent each point of probability. A random number between
00 to 99 is used to represent the same. In this example, as the probability of 5 loaves of bread
is 0.05, we have assigned 5 random numbers starting from 00 to 04. Similarly, each demand
level is assigned appropriate intervals of random numbers. Hence a cumulative probability is
calculated to assign numbers to correspond to the same probability range for each event.
Similarly, if probabilities are calculated to 3 decimals then 1000 random numbers are
assigned starting from 000 to 999 and soon.
Demand Probability Cumulative Probability | Random Number Interval
5 0.05 0.05 00-04
0.18 0.23 05-23
7 0.12 0.35 24-34
8 0.1 0.45 35-44
9 0.17 0.62 45-61
10 0.2 0.82 62-81
1 0.15 0.97 82-96
12 0.03 a 96-99
228 Total 1
OPMCO001
Business Statistics
Step 2
After determining the random number intervals, we use to excel to generate the random
number using the function RAND BETWEEN(0, 99) drag this formula to the cells and generate
as many random numbers are required to be simulated. Using this formula, we generate
demand for 15 days. The number generated are: 12, 40, 22, 1, 28, 61, 19, 94, 87, 38, 29, 46,
72,1, 16. Now 12 lies in the range 5-23 corresponding to the demand of 6 loaves of multigrain
bread, inthe same way, the entire table is completed.
In other words, the numbers assigned to each occurrence are directly proportional to its
probability.
1 12 6
2 40 8
3 22 6
4 1 5
5 28 7
6 61 9
7 19 6
8 94 11
9 87 11
10 38 8
11 29 7
12 46 9
13 73 10
14 1 5
15 16 6
Hence, with the help of simulation the baker can form an idea about how many loaves of
multigrain bread should he bake to satisfy the demand.
However, if the variables are not truly random and follow a normal distribution, it will lead to
erroneous results.
In inventory management variations are observed in both the demand and the lead time, in
such asituation simulation can be of help in forecasting
the variables.
Example 14.2
Aconfectionary shop owner is interested to know that with specific re-order levels and re-
order quantities, how he can optimize the total inventory cost. The details of probability
distribution and the various costs are shared below:
Probability | 0.05] 0.11 | 0.12 | 0.08) 0.18] 0.13/ 0.09/ 0.1 | 0.11 | 0.03
The ordering cost is known to be Rs 80 per order, the holding cost is Rs 5/day, while the unit
shortage cost i.e. loss in profits is Rs 20/unit/day. Evaluate a simulation plan for 2 months for
re-order quantity of 50 units, re-order level of 20 units with an inventory balance of 50 units.
Solution:
The first step is to assign a coding system that can assign demand to the random variable. As
the probabilities are calculated to 2 decimals, the random numbers generated are between
00 to 99.
The random numbers coding for both the distributions is shown in the tables below.
12 0.03 1 97-99
230
OPMCO001
Business Statistics
Let’s solve the following problem using EXCEL below. Using the function RANDBETWEEN(),
random numbers are generated for demand. For instance, the random number 58 for day 1,
lies in the interval 54-66 corresponding to the demand of 8 units. With the initial balance of
50 units as shown in the column (7) of the table and demand of 8 units on the first day, the
number of units in the stock would be 42 for this day, involving a holding cost of Rs 210 (42 x
5) as shown in column (9). Now on day 2, the demand simulated is 3 units leaving a balance of
39 units, with a corresponding holding cost of Rs 195 and soon.
Now on day 5, the demand generated is 9 units leaving a balance of 13 units, at this point as
the balance falls below 20 units, an order of 50 units is placed with the corresponding
ordering cost of Rs 80 shown in column (8) At this point a random number is generated for
lead time using the function RANDBETWEEN(), which generates 7 which lies in the interval
00-34 corresponding to lead time of 1 day. Hence, the stock of 50 units would be received the
next day i.e. on day 6 as entered in column (6). We proceed in the same way and tabulate the
results below.
On day 47, the simulated demand is 6 units and the balance is 4 units, in this situation, we can
just satisfy the demand of 4 units and we are left we a balance of 0 units. The 2 units which
could not be sold contribute to a shortage cost of Rs 40 (2 x 20) is shown in column (10).
Ordering | Holding | Cost of
Day R.No | demand | R.No | L.Time | Receipts | Balance Cost Cost | Shortage
(1) (2) (3) (4) (5) (6) (7) (3) (9) (10)
0 50
1 58 8 42 210
2 3 3 39 195
3 81 10 29 145
4 42 7 22 110
5 73 9 7 1 13 80 65
6 63 8 50 55 275
7 31 6 49 245
8 26 5 44 220
9 56 8 36 180
10 35 6 30 150
11 84 10 20 100
12 30 6 30 1 14 80 70
13 16 5 50 59 295
14 47 7 52 260
15 64 8 44 220 231
UNIT 13
Simulation
NOTES 16 68 9 35 175
17 60 8 27 135
18 87 11 53 16 80 80
19 6 4 12 60
20 54 8 50 54 270
21 88 11 43 215
22 48 7 36 180
23 57 8 28 140
24 54 8 20 100
25 82 10 45 10 80 50
26 6 4 6 30
27 12 4 50 52 260
28 50 7 45 225
29 2 3 42 210
30 72 9 33 165
31 37 7 26 130
32 39 7 25 19 95
33 27 5 50 64 320
34 10 4 60 300
35 98 12 48 240
36 82 10 38 190
37 14 4 34 170
38 57 8 26 130
39 54 8 73 18 80 90
40 54 8 10 50
41 42 7 3 15
42 80 10 50 43 215
43 40 7 36 180
44 94 11 25 125
45 73 9 66 16 80 80
46 98 12 4 20
47 30 6 0 80 0
48 68 9 50 41 205
49 37 7 34 170
50 34 6 28 140
51 16 5 23 115
52 12 4 2 19 80 95
53 83 10 50 59 295
54 25 5 54 270
55 80 10 44 220
56 18 5 39 195
57 34 6 33 165
58 78 10 23 115
59 92 11 28 12 80 60
60 D3 8 50 34 270
232
OPMCO01
Business Statistics
Completing 60 days, we find that total ordering cost= Rs 720, the holding cost = Rs 9700, and
out-of-stock cost = Rs 40, adding up to Rs 10,460
Executing the same operations only changing re-order level to 15 units,(not shown here, the
student is expected to compute independently), remaining variables are constant and we
obtain a total cost of Rs 9,675 comprising total ordering cost= Rs 720, the holding cost =
Rs 8735, and out-of-stock cost = Rs 220.
Similarly, the total cost obtained for re-order quantity of 30 units keeping other variables
constant we obtain the total cost of Rs 7,700 comprising total ordering cost = Rs 1,120, the
holding cost = Rs 6,460, and out-of-stock cost = Rs 120.
We can thus optimize the total cost by changing parameters.
For reference, we share the Formulas used in EXCEL to solve the above problem.
233
UNIT 13
Simulation
Assumption. Arrivals are infinite, waiting capacity is unlimited, and customers are served in
the order
of their arrival (FCFS), arrivals are random, service times are random.
Example 13.3
In a large bank, the manager is concerned about the waiting time of customers. He is ina
dilemma about whether to hire more staff to raise the level of service, but this will also lead
to an increase in the idle time of staff He wants to determine how many staff to be hired to
minimize the total cost involved. He has shared the data of the times between successive
arrivals and service times for the past 200 observations. He requires help in optimizing the
cost.
Distribution of inter-arrival time
0 12
3 18
6 50
9 74
12 32
15 14
200
234
OPMCO001
Business Statistics
4 8
6 20
8 36
10 88
12 48
200
Solution:
Just like the previous example, the first step is to estimate the probability and cumulative
probability. Then based on cumulative probability random numbers are assigned to each
observed arrival time. Similarly, random numbers are assigned to service times also.
Random Number Coding for Inter-arrival times.
18 14 0.07 1 93-99
200 1
235
UNIT 13
Simulation
12 48 0.24 1 76-99
200
Now we are ready to simulate the operation. To determine the arrival times, we start with 5
major columns and for service time we start with 4 major columns. We shall simulate the
bank problem for 30 days.
Now, the first random variable for arrival is 48 which lies in the interval 40-76 corresponding
to 12 minutes. Assuming that the bank starts at 9:00 AM, the first arrival in the bank takes
place at 9:12 AM. Further, the random variable for service time is 40 which lies in the interval
32-75 corresponding to 10 minutes. Thus, the service time starts at 9:12 and ends at 9:22.
There would be no waiting time for the mechanic. The second random variable for arrival is
49 which lies in the interval 40-76 corresponding to 12 minutes. So the second arrival would
be 9:24AM. The second random variable for service time is 57 which lies in interval 32-75
corresponding to 10 minutes. As the first customer would leave the bank by 9:22, the waiting
time for the second customer is also 0. As there is no customer before him, he is the only
person standing in the queue, so the queue length is 1. We complete the table in the same
way, generating random numbers for Day 1.
236
OPMCO001
Business Statistics
The average waiting time is 77+30 = 2:57 minutes and the average Queue length is 46+30=
1.53. In this way by running this simulation multiple times the manager can decide on the
average waiting time and average queue length in his bank. Using this information, he can
examine the alternatives by adding multiple counters in the bank and determine the
outcome based on new service patterns and opt for the best alternative.
A company manufactures 32 units per day. The sale of these items depends upon demand
which has the following distribution.
237
UNIT 13
Simulation
sales probability
30 0.35
31 0.15
32 0.05
33 0.1
34 0.15
35 0.2
The production cost and sale price of each unit are Rs 60, and Rs 80 respectively. Any unsold
product is to be disposed of at Rs 40 per unit. There is a penalty of Rs 5 per unit if the demand
is not met. Using the following random numbers, estimate the total profit/loss for the
company for the next 10 days. 1, 9, 17, 99, 20, 85, 77, 63, 13, 38. Will it be advantageous to
produce 30 units per day
Solution:
35 0.2 1 80-99
Profit= Rs 80 - Rs 60 = Rs 20/unit
Loss = Rs 40/unit
Penalty on stock-out = Rs 5/unit
238
OPMCO001
Business Statistics
The total profit for 10 days is Rs 6,055 when 32 units are produced, and if the company
produces 30 units, then the total profit is Rs 5,875. Hence, the company should continue to
produce 32 units.
Example 13.5:
The toy factory produces robots that undergo two assembly lines to get the final product.
The processing time for each of the assembly line is regarded as a random variable and is
described by the following distributions.
5 0.1 0.35
6 0.15 0.3
7 0.2 0.2
8 0.25 0.1
9 0.18 0.03
10 0.12 0.02
239
UNIT 13
Simulation
Using the following random numbers find the expected process time for the period.
R. No for assembly 1: 34, 43, 2,5, 28, 76, 33, 45, 89, 24, 43, 15,90, 80,9
R. No for assembly 2: 21, 83, 36, 75, 74, 11, 94, 34, 19, 8,91, 44, 12, 65, 54
Solution:
Process Time Assembly1 | Cum. R.N. Process Time) Assembly2| Cum. Prob R.N.
(mins) Prob | Interval (mins) Interval
5 0.1 0.1 00-09 5 0.35 0.35 00-34
1 34 7 21 5 12
2 43 7 83 7 14
3 2 5 36 6 11
4 5 5 75 7 12
5 28 7 74 7 14
6 76 9 11 5 14
7 33 7 94 8 15
8 45 8 34 5 13
9 89 10 19 5 15
10 24 6 8 5 11
11 43 7 91 8 15
12 15 6 44 6 12
13 90 10 12 5 15
14 80 9 65 7 16
15 9 5 54 6 11
240 200
OPMCO001
Business Statistics
241
INDEX NUMBER
STRUCTURE
14.0 Objectives
14.1 Introduction
14.2 Definitions
14.3 Types of Index Number
14.4 Methods of constructing Index Number
14.5 Unweighted Index Number
14.6 Weighted Index Number
14.7 Testtoverify the consistency of Index Number
14.8 Index Number Used in India
14.9 LetUsSumUp
14.10 Key Words
14.11 Self-Assessment Questions
14.0 OBJECTIVES
After reading this unit, you will be able to:
® tolearn to compute index numbers to measure price and quantity changes and
interpretthem
14.1 INTRODUCTIONS
The price of a commodity, the volume of imports and exports, the quantity of agricultural
production, unemployment, etc. change with time. The changes are neither constant nor
follow a pattern. Some increase with time e.g. population, prices of commodities, etc. while
some of the variables decrease with time e.g. death rate, purchasing power, etc. It is
242
OPMCO001
Business Statistics
important to study these changes to plan for the future. “Indexing” is a technique that
measures changes in a variable or a family of the variables with time, location, or other
characteristics. It is one of the most widely used statistical methods, yet a simple and
effective tool. For example, a pharmaceutical firm may be interested in manufacturing a drug
for cervical cancer, hence they are interested to find out if the no. of cases of cervical cancer
reported this year has increased or decreased and by what extent compared to the previous
year, a housewife needs to compute her monthly budget and interested to know the change
in the price of LPG and essential items over the past year, a company may be interested in
changes in prices of raw materials, wages, advertising costs, share prices, profits, etc. For
arriving at decisions, one may be interested to know how much the price of a good has
changed overtime.
14.2 DEFINITIONS
The index number is a relative measure to compare and describe the average change in
values of prices, quantities, or values of an item or a group of related items over some time
The ratio is multiplied by 100 and expressed as a percentage. As the Index number is
estimated as a ratio of onetime-period over another, it has no unit.
(current period value)
Index number= ——————————————— x 1100
(base period value)
Different authors and institutions across the world have provided different definitions. A few
selected definitions are shared below:
According to Tuttle: “Index number is a single ratio (or a percentage) which measures the
combined change of several variables between two different times, places or situations.”
In the words of Maslow “An index number is a numerical value characterizing the change in
the complex economic phenomenon over some time or space.”
Spiegal defines, “An index number is a statistical measure designed to show changes ina
variable on a group of related variables concerning time, geographical location or other
characteristics.”
According to Croxton and Cowden “Index numbers are devices for measuring differences in
the magnitude of a group of related variables.”
Bowley describes “Index Numbers” as a series which reflects its trend and fluctuations in the
movements of some quantity.”
According to Wheldon, “An index number is a device which shows by its variation the changes
in a magnitude which is not capable of accurate measurement in itself or direct valuation in
practice.”
Edgeworth gave the classical definition of index numbers as follows: “index number shoes by
its variations the changes in a magnitude which is not susceptible either of accurate
measurement in itself or of direct variation in practice.”
243
UNIT 14
Index Number
in the words of Lawrence J. Kaplan, “An index number is a statistical measure of fluctuations
in a variable arranged in the form of a series, and using a base period for making
comparisons.”
Reading the above definition of index numbers, we can see that index numbers are defined
in three categories either as a measure of change, or a device to measure change, or a series
representing the process of change.
Index numbers are broadly classified into three categories (i) price indexes (ii) quantity
indexes (iii) value indexes
The price index is one of the most prevalent indexes. It is a special type of average
which measures the levels of price from one period to another. In estimating Price
indexes comparisons are made to prices. E.g. wholesale price index number, retail
price index numbers, consumer price index number, etc.
Year (1) | Price of wheat per quintal (2)/Ratio (3)=(2)+650 | Percentage Relative
(4)= (3) x 100
2006 650 1 100
From example 14.1, it is observed that compared to the base year 2010 the price
relative of 115.3846 in 2006 shows an increase of 15.38% in the price of wheat, price
244
OPMC001
Business Statistics
Example 14.2: calculation of quantity index number (the base year 2010)
From example 14.2, it is observed that compared to the base year 2010 the price
relative of 109.7674 in 2010 shows an increase of 9.76% and soon.
Value index numbers are used to study the change in the total value of a certain period
with the total value of the base period. For example, the turnover of a company in
2020 compared to 2008.
245
UNIT 14
Index Number
From the above table, we can say that the turnover of 2010 increased 48% price
relative to the turnover of 2015.
The index number estimates the relative changes in an item or group of items
Index numbers can be used as one of the forecasting techniques, index numbers
estimate trends that help in making conclusions in cyclical and irregular
components.
Limitations/Precautions while constructing index number
As indices are estimated from sample data, the results are to be looked cautiously
as there are chances of committing errors.
Proper selection of the base period is very important, the conclusions changes
246
OPMCO001
Business Statistics
e There are several methods to estimate the index number. Hence selection of
appropriate weights and selection of appropriate formula, is very important any
mistake here would lead to a wrong interpretation.
Unweighted Weighted
Q,- 2a x100
" Sa
DP
de 199
Voi Spode.
Poo
247
UNIT 14
Index Number
Example 14.4:
Construct the Price Index Number for the year 2019, from the following information taking
2018 as the base year.
Sugar 32 34
Wheat 25 30
Solution:
Construction of Price Index:
Sugar 32 34
Wheat 25 30
P.
P= ee x 100
0
=1.2362
248
OPMCO001
Business Statistics
Example 14.5
Calculate Price Index Number for 2018 from the following data by simple aggregate method,
taking 2019 as the base year.
2018 2019
Potato 15 30
Tomato 50 80
Onion 30 70
Lemon 80 100
Solution:
Potato 15 30
Tomato 50 80
Onion 30 70
Lemon 80 100
Price Index:
P= oe x 100
0
= 5607395
P,1=141.77%
The price index for the year 2016 when compared to 2015 has been increased by 24.13%.
249
UNIT 14
Index Number
YFix100
_ 0
Po ~
N
Average Price Index Number Using Geometric Mean is estimated using the following
formula:
log F. 100)
P,, = antilog x
N
Limitations
1. As equal importance is given to every item in the index number i.e., every item in the
index number is given equal weights, but in actual practice, it is not true some price
relatives are more important than others.
2. Out of both the methods arithmetic mean is preferred over geometric mean to
calculate the average price relatives.
Example 14.6
Compute price index number by a simple average of price relative’s method using arithmetic
mean and geometric mean.
250
OPMCO001
Business Statistics
Potato 15 30
Tomato 50 80
Onion 30 70
Lemon 80 100
Solutions:
Calculation of price index number by a simple average of price relatives:
Dix 100
P, 01 = —~—
N
P,, =971.67+6=161.94
{ii} Price relative index number based on geometric mean:
B
log og (2
| —x x 100 )
P,, = antilog °
=antilog (13.17416+6)
251
UNIT 14
Index Number
Hence, the price index number based on the arithmetic mean and the geometric mean for
the year 2002 are 161.94 and 156.9 respectively.
Product A B Cc D E
14.5 Construct simple average price relative index number using the arithmetic mean for
the year 2012 for the following data showing the profit from various categories sold out in
departmental stores.
14.6 Construct simple average price relative index number using the geometric mean for
the year 2015 for the data showing the expenditure in the holiday destination of the family of
three.
Expenditure per week for a family of three 2014 2015
weights are assigned to different items. The common method of assigning weights is either NOTES
by quantities consumed or by its value sold.
In a weighted aggregate price index, each item in the basket of items chosen for estimating
the index is assigned a weight either by corresponding quantities produced or consumed or
sold to show their importance either in the base year or in the current year. This is because
the consumption quantities of the group of customers will be more for some items and less
for others. Hence, it is also required to obtain a measure of quantity used for the various
items in the group. Estimating the index for the quantity is thus a better estimate than just
estimating the changes in price over time. It helps to improve the accuracy of the price level
estimate. It is useful to monitor changes in price levels over different periods. Inflation
reduces the purchasing power of the individual, hence it is very important to split real
income from nominal income. As there are various ways of assigning weights, there are
many methods available for constructing index numbers. A few important approaches to
determine weights are described below:
a) Laspeyres’ Index (P,,)
b) Paasche’s Index (P,,’)
c) Dorbishand Bowley’s Index (P,,”)
d) Fisher’s Ideal Index (P,,)
e) Marshall-Edgeworth Index (P,,°”
f) Kelly’s Index (P,,)
g) Walsch’s Index
a) Laspeyres’ method
This method was developed by German economist Etienne Laspeyres in 1871. He proposed
to use base year quantities as weights. Hence, it is also called the base year quantity
weighted method. The formula for estimating Laspeyres’ Price Indexis:
basket of goods in the current period as he did in the base period. This helps to answer the
questions about, how much the income has to be increased to compensate for inflation.
Advantage:
The main advantage of this method is that as the index uses the same base price and
quantity, it becomes easier to compare the index of one period with another.
Disadvantage:
The consumption of commodities increases and decreases with fluctuation in prices, so
keeping the quantity as fixed may not be realistic, It assumes that the consumer is consuming
the same basket of goods as before. However, it has been found that with an increase in
prices in some goods, the quantity consumed decreases and consumption shifts to goods at
lower prices.
The following curve generally holds good.
$75
$50
$25
$0
0 25 50 75 100 QUANTITY (Units)
Example 14.7: Compute the cost of the living index using Laspeyres’ method from the
following information.
Sugar 100 32 34
Wheat 200 25 30
Solution:
Commodities | Quantity consumed in 2018)Price in 2018 (Rs}| Price in 2019(Rs)| p,q, Pode
=1,11,200 , 100
89,200
= 124.,3367
Advantages
The main advantage of Paasche’s method is, it focuses on changes in price and quantity,
255
UNIT 14
Index Number
hence it provides a better estimate of changes in the index compared to the Laspeyres
method. If the quantity consumed in the base year is the same as the quantity consumed in
the current year than we get the same answer using Laspeyres’ and Paasche’s index
numbers.
Disadvantage
As this method is computed using the quantities consumed in the current year, obtaining the
data is quite a time consuming and expensive affair. Moreover, unlike Laspeyres Index
Number, it is difficult to compare indexes of different periods, since to compute indexes of
each year it is required to re-compute the effect of the previous year.
Example 14.8: For the following data, calculate the price index number of 2009 with 2008 as
base year using Laspeyres’ method and Paasche’s method.
Items! Quantity consumed in 2018 | Quantity consumed in 2019 Price in 2018 Pelee In fot
A 20 30 10 15
B 30 20 12 10
Cc 40 60 14 20
D 50 50 20 25
E 60 40 25 22
Solution:
Laspeyres’ Method
Items A q: Po Py PiGo Poo
A 20 30 10 15 300 200
B 30 20 12 10 300 360
C 40 60 14 20 800 560
D 50 50 | 20 | 25 | 1,250 | 1,000
E 60 40 25 22 1,320 1,500
=3970+3620=109.6685
256
OPMCO001
Business Statistics
Paasche’s Method
A 20 30 10 15 450 300
B 30 20 12 10 200 240
Cc 40 60 14 20 1,200 840
D 50 50 20 25 1,250 | 1,000
E 60 40 25 22 880 1,000
Laspeyres' price index shows a price level increase of 9.67%, whereas Paasche's price index
shows a price level increase of 17.75%.
c) Dorbish and Bowley's method
Laspeyres' method is based on the impact of quantities of the base year, and Paasche's
method is based on the impact of quantities of the current year. To capture the influence of
both the periods i.e. base period and current period, Dorbish and Bowley in 1901 suggested
to take the average of both the indexes. The formula for this Index is given as
In 1920, Irving Fisher proposed to estimate the index number using the geometric mean of
Laspeyres’ Index and Paasche’s Index. So, it is also called Fisher’s Ideal Index Number, the
formula is given by:
Advantages
® Thegeometric mean isthe besttool for constructing index numbers.
e As the quantities are used as weights for both the current and base period, it
avoids the biases associated with both Laspeyres’ and Paasche’s index number.
@ This method also satisfies the two important tests required for the index i.e. the
time-reversal test and factor-reversal test.
Disadvantage
Although the index is theoretically better it is not used often as this method needs a lot of
computation time.
Example 14.9: For the following data, calculate the price index number of 2009 with 2008 as
base year using Fishers ideal method.
B 30 20 12 10
Cc 40 60 14 20
D 50 50 20 25
E 60 40 25 22
Solution:
Hence, we conclude that in the year 2019 the price index has increased by 13.64%
Example 14.10 For the following data, calculate the price index number of 2009 with 2008
as base year using Fishers ideal method.
A 200 300 10 15
B 300 200 12 10
Cc 420 600 i5 20
D 500 500 20 25
E 600 400 25 20
Solution:
There is a difference between the data of the previous example and this one. Here instead of
quantities used the data given is on expenditure. So we need to calculate the quantity based
on expenditure and price. Quantity consumed = expenditure + price.
. . 2,215 2,000
Fisher's Ideal Price Index = x x100 =110.688
2,020 1,790
259
UNIT 14
Index Number
This takes the average of quantities consumed, (the division by 2 cancels out) and thus takes
into account both the altered quantities due to change in prices as well as the original
quantities as the consumer may have wished to consume in the current year as much as the
previous year — unable to do so due to price inflation. It provides due weightage to this
aspect.
f) Kelly’s method
This method was proposed by Truman L Kelly, he suggested the fixed weight approach to
estimate the index number
Here q is a quantity that may not necessarily refer to the base or current year. If q is the
average quantity of two years then q = (q, + g,)+2. Similarly average or 3 or more years can
also be used as weights. The logicis the same as Marshall-Edgeworth.
Advantages:
The main advantage of this method is that it does not require the information regarding the
yearly changes in the weights. A basket for various years’ consumption can be a better
indicator. This can improve the accuracy of the index number. The only point to remember is
that the weight should be appropriate and indicate the relative importance of various
commodities.
Disadvantage
The index does not take either the base year or current year as a fixed weight.
g) Walsch’s Method
Correa Walsch proposed the formula in 1901. He proposed that the weight be used as the
geometric mean of the base year and current year quantities. The formula is given below
2 Paasche’s method
4. _ Fisher’sideal method
5 Marshall-Edgeworth method
6 Kelly’s method
7 Walsch’s method
B 30 20 12 10
Cc 40 60 14 20
total 90 110 36 45
Solution:
261
UNIT 14
Index Number
Now for calculating Kelly’s and Walsch’s Price Index we need to calculate q and geometric
mean ofq, &q,
q=(q,+q,)/*
Q=sqrt(q,q,)
262
OPMC001
Business Statistics
(6) KellyMethod
=(1,625+1,250)x100 = 130%
Example 14.12 Calculate the suitable price index for the following data.
2015 2016
A 20 3 7
B 25 5 8
Cc 35 6 9
Solution: Kelly’s Index price number is the most suitable index number as the quantities for
both the current year and base year are the same
Commodity q B, P, Pod pq
A 20 | 3 7 60 140
B 25 | 5 8 125 200
c 35 | 6 9 210 315
395 655
263
UNIT 14
Index Number
14.7 Calculate the price indices from the following data by applying -
Oil 20 10 25 13
Pulses 50 8 60 7
Sugar 35 7 40 6
Wheat 25 5 35 4
14.8 Calculate the Dorbish and Bowley’s price index number for the following data taking
2014 as the base year.
Items 2014 2015
A 50 30 54 35
B 35 2 45 3
Cc 80 3 100 4
D 25 2 30 3
E 35 2 40 3
Example 14.9: Compute Marshall-Edgeworth price index number for the following data by
taking 2016 as the base year.
Example 14.10: Calculate the suitable price index for the following data:
2015 2016
A 30 4 6
B 15 3 7
c 20 6 8
The weighted average of price relatives is computed by introducing weights into the
unweighted price relatives. The weights are determined by the value consumed in the base
period for weighting the commodities. As shown previously, we may use either arithmetic
mean or the geometric mean to average weighted price relatives. The weights are used to
reflect the consumption levels of the average consumer.
The weighted average price relatives using arithmetic mean:
lf the price relative index p = [p,* p,] x 100 and w=p,q,, then the weighed price relative index
is:
y| 2 100)! Po
P0 J
Po =
> Poo
x wP
Po =
=
The weighted average price relatives using a geometric mean:
Py = anti oe
zee)
ree?
yw
Example 14.13
Compute the price index for the following data by applying the weighted average of price
relative method using (i) Arithmetic mean and (ii) Geometric mean.
Computation for the weighted average of price relatives using arithmetic mean.
1,500 1,69,000/3,077.42
y wP
P,, = <4— = 169,000 + 1,500 = 112.67
Ww
This means that there has been a 12.67% increase in prices over the base year.
The index number using the geometric mean of price relatives is:
wlo
P,, = antilog [Sees = antilog (3,077.42 + 1,500)
= antilog (2.0516)
= 112.62
This means that there has been a 12.62% increase in prices over the base year.
3. QUANTITYINDEX NUMBER
Just like the Price Index number the quantity index number measures the changes in the
level of quantities of items consumed, or produced, or distributed during a year under
investigation compared to another year known as the base year.
266
OPMCO001
Business Statistics
Similar to price index these formulae measure the quantity index in which quantities of the
different commodities are weighted by their prices.
Example 14.14
Compute the following quantity indices from the data given below:
A 10 100 15 225
B 15 105 18 378
Cc 8 152 11 242
Solution:
Since we are given the value and the prices, the quantity figures can be obtained by dividing
the value by the price for each of the commodities.
= 174.23
= 176.87
Many researchers have suggested different formulas to verify the consistency or adequacy
of anindex number. Some of the most used tests are given below
e Timereversal test
e Factor reversal test
e Circulartest
e Unittest
It is not possible by any particular formula of an index number to satisfy all the tests
mentioned above. An ideal formula is the one that satisfies the maximum possible relevant
tests under study.
According to the order reversal test even if the arrangement of the items is reversed, the
value of the index number should not change. All the twelve methods of index number
satisfy the order reversal test.
Time Reversal Test
The time-reversal test was proposed by Irving Fisher. According to Fisher, “Time-reversal test
is the test which gives the same ratio between one point of comparison with other for the
calculation of index number irrespective of the fact which of the two is taken as a base”. This
test maintains the time consistency by working in both the direction i.e. forward and
backward with time. In simple words, if the product of index number results in unity when
the base year are interchangedi.e. P,, x P,,=1
The time-reversal test is satisfied by simple aggregative method, Fisher’s method, Marshall-
Edgeworth’s method and Kelly’s method.
268
OPMC001
Business Statistics
_ {2 Po% 2 Podo
VS pa, “Pio
This test is also suggested by Fisher. In Fisher’s words “Just as each formula should permit the
interchange of two time periods without inconsistent results, similarly it ought to permit
interchanging the prices and quantities without giving inconsistent results i.e., the two
results multiplied together should give the true ratio.”
This test too has been proposed by Prof. Irving Fisher, according to him a formula of index
number should be able to give consistent results even if price and quantity factors are
interchanged, i.e. Price indexx Quantity Index
= Value Index.
Except for the Fisher Ideal index number, none of the formulas discussed above satisfy this
test.
Qn 1 GiPo a 9,P;
Vy GoPo °F GoPi
Circular Test
It is an extension of the time-reversal test. The time-reversal test takes into account only two
years, i.e. the current and base years. This test requires that an index number formula should
be such that it works circularly. An index number is said to satisfy the circular test when there
are three indices, P,,, P,,and P,,, such that P,, x P,, x P,,=1.
269
UNIT 14
Index Number
The circular test is not satisfied by the weighted aggregative method. This test is satisfied by a
simple aggregative method.
Laspeyres, Paasche’s, Fisher’s ideal index, Marshall and Edgeworth’s, Dorbish and Bowley’s,
etc. do not satisfy this test.
However, there are the following three methods which do satisfy the test
e Simpleaggregative method
e Weighted aggregative method
e —_ Kelley’s Method
Unit Test
If the formula of the index number is such that the value of the index number is not affected
when the units of prices are altered i.e. weights in kg are converted to weights in quintal or
vice versa then it satisfies the criteria of the Unit test. The formula of all the index numbers
satisfies this test except the simple aggregative method in which the units of the price of any
item changes index number changed drastically.
Example 14.15
The table below provides the prices of the base year and current year of 5 commodities with
their quantities. Use it to verify whether Fisher’s ideal index satisfies the time-reversal test.
A 10 10 15 15
B 15 7 18 21
Cc 12 8 15 12
D 18 12 22 15
E 8 19 11 22
Solution:
Po XP = 1
Therefore, Fisher’s Index number satisfies the time-reversal test.
Example 14.16
Calculate the price index and quantity index for the following data by Fisher’s ideal formula
and verify that it satisfies the factor reversal test.
A 10 10 15 15
B 15 7 18 21
Cc 12 8 15 12
D 18 12 22 15
E 8 19 11 22
Solution:
Index number by Fisher’s ideal index method
2, = | P DPd
> Pode Spa
271
UNIT 14
Index Number
1355) 1355
Por X Qoi = (2) = 665
Po: * Qo: = 2a
odo
Hence, Fisher ideal index number satisfies the factor reversal test
14.8 INDEXNUMBERSININDIA
2) ~=NIFTY50
3) S&P BSE Sensex
Consumer Price Index Number (CPI)
The Consumer Price Index, which, is commonly known as CPI is an economic indicator. Itis a
tool that examines the effect of changes in prices, for a basket of consumer goods and
services, such as transportation, food, medical care, etc. is used as a measure of inflation. CPI
is a weighted average represented in terms of percentage and is estimated by taking into
account the changes in prices of each item in a predetermined basket of goods. CPI aims to
compare the consistent base of products from year to year, focusing on the products that are
brought and used by consumers daily It is one of the most used tools of statistics which helps
in identifying the periods which shows fluctuations in the process like inflation or deflations.
The changes in the prices of commodities, impact the cost of living, of a diverse group of
population ina different way. As the consumption patterns of commodities differ, in different
groups of society. The general index number fails to take this into account.
Uses of the Consumer Price Index (CPI)
2. Government uses CPI for estimating wage policy, price policy, rent control, taxation
and general economic policies.
3. | An awareness about price changes in the economy, it can act as a guide for making
informed decisions about the economy and budgetary provisions.
4. CPlis also used for studying market price for a particular kind of goods and services.
> P14
the consumer price index number = ¥ x100
Example 14.17
Calculate the consumer price index number for 2015 based on 2000 from the following data
by using (i) the Aggregate expenditure method (ii) the family budget (or) weighted relatives
method.
Commodity Quantity Price
2000 2010
Oil 2 80 120
Rice 10 25 35
Wheat 20 20 40
Sugar 8 30 32
273
UNIT 14
Index Number
Solution
(i) Calculation of cost of living index number based on the Aggregate expenditure
Method.
The consumer price index number = 2Pide x 100
> Po4do
(ii) Calculation of consumer price index number according to family budget method or
Weighted Relative Method.
wp
the consumer price index number = x w = 1,64,600+ 1,050=156.76
NIFTY50
NIFTY 50 Index is the National Stock Exchange (NSE) of India’s popular stock market index. It
represents the weighted average of 50 of the largest Indian companies listed on NSE. It is one
of the two stock indices used in India, the other being the BSE-SENSEX. NIFTY was launched
in 1996 and is owned and managed by NSE Indices (previously known as India Index Services
and Products Limited), a wholly owned subsidiary of the NSE Strategic Investment
Corporation Limited. NIFTY 50 covers 14 sectors (as of 20 June 2020) of the Indian economy
and offers investment manager sexposure to the Indian market in one portfolio.
The NIFTY 50 Index gives a weightage of 39.47% to financial sectors, 15.31% to Energy,
274
OPMC001
Business Statistics
>
13.01% to IT, 12.38% to consumer goods, 6.11% to automobiles and 0% to the agricultural
sector.
The base value of the index has been set at 1,000 to the base date of November 31,995. Itisa
free-float market capitalization-weighted index. i.e. a floating factor is assigned to each stock
to account for the proportion of outstanding shares that are held by the general public, as
opposed to closely held shares owned by government, royalty or company insiders,
5 &P BSE Sensex
The S&P BSE SENSEX (S&P Bombay Stock Exchange Sensitivity Index) is one of the globally
renowned stock market indexes. This index reflects the free-float market-weighted stock
market index of 30, well established, and financially sounds companies. The 30 companies
that are representative of the various industrial sectors of the Indian economy are selected.
It was compiled in the year 1986, the base value is taken as 100 fixed with the base year
1978-79.
14.9 LETUSSUMUP
\
1. Price relative in period n,P,n= — x100
0
> ;2*100
Pp. = —2 _
N
Simple Geometric Mean of Price relative
log [7 «10 |
Py
P,, = antilog
N
275
UNIT 14
Index Number
Qo. = 24s
va 100
p = 2P4.199
" >po
_ LPL,
Paasche's Price Index =
xo
+ Piss,
Marshall Edeeworth Price Index= LPG
+
Spode + Sha
Pod
_ ie
yw’
where w= p,q, and P= Po 100
276
Po
OPMCO001
Business Statistics
Py, = anti oe
4. Quantity index
(a) Unweighted quantity index for periodn
> 4. 100
Qo = >a,
> «100
Qon = —40
N __
_ 24.
"Yaw
14.10 KEYWORDS
Index Number:A ratio that measures the change in the variable over some time.
Base period: It is the reference period against which comparisons are made.
Consumer Price Index Number: The average change in the prices paid by the consumer on
specified goods and services over some time, sometimes referred to as the Cost of Living
Index.
Fixed Weight Aggregate Method: Quantities consumed in the specific period are used as
weights to calculate the aggregate index.
Laspeyres’ Method:
To measure the aggregate index, this method uses quantities consumed
277
UNIT 14
Index Number
Price Index: It is used to compare the changes in the prices of commodities from one period
to another.
Value Index: It is a measure to study the changes in total monetary worth over atime.
Unweighted Aggregative Price Index: All the values are assigned equal importance to
estimate the changes in prices, over time, for an entire group of commodities.
Unweighted average of price relatives: it is the average of price relatives for all items. The
average could be arithmetic mean or geometric mean.
Weighted aggregate index: It is an aggregate of items that have been weighted in some way
either by corresponding quantities produced, consumed, or sold to reflect
their importance.
The weighted average of relatives method: The average is estimated by multiplying price
relative by its weight and the total quantity consumed is considered as weight.
14.11 The following data shows the monthly rent of a 2BHK house in different locations
over three years in a location. Calculate simple aggregate price index number for the year
2006 and 2007 using 2005 as the base year.
14.12 The data below describe the average salary for the employees in the company over
the past 10 consecutive years. Calculate an index for these averages using year 4 as the base
year. Calculate percentage points change between consecutive years.
278
OPMCO001
Business Statistics
Year Salary
1 9,800
2 10,200
3 11,000
4 12,400
5 13,100
6 14,100
7 14,900
8 15,700
9 16,400
10 17,800
14.13 Below are the prices for different commodities for the years 2008 and 2009.
Calculate the price index based on price relatives using the geometric mean.
A 43 50
B 48 55
c 32 5
D 61 64
E 40 43
F 52 55
14.14 The price for consecutive four years for a men’s clothing brand is given below.
Calculate an unweighted average of price relatives index for each year using 2000 as the base
year.
279
UNIT 14
Index Number
14.15 Calculate the weighted average of relatives quantity indices using price and quantity
from 1995 to compute value weights, with 1995 as the base year.
Sedan 45 48 56 13.9
Hatchback 64 67 71 8.3
SUV 28 35 27 23.8
MUV 21 16 28 15.7
14.16 Abook publishing house is interested to know whether the sales have changed after
the release of the first edition of the book. Using 2011 as a base calculate the unweighted
aggregate quantity index for 2012 and 2013.
English 11 8 15
Maths 27 26 30
Science 10 26 32
Social Studies 24 18 26
physics 16 20 21
Chemistry 19 15 22
Biology 32 37 35
Economics 48 53 50
280
OPMC001
Business Statistics
14.17 Thedata given belowis the price and respective quantities sold by a farmer for crops
grown in past years. Construct Laspeyres’, Paasche’s, Dorbish & Bowley’s, Marshall
Edgeworth index.
2010 2011
potato 10 90 12 105
corn 20 50 29 60
14.18 The data given below are the no. of individuals who have taken personal health
insurance to calculate the unweighted average of the relative price index of each year.
Doctors 54 65 86 103
Fireman 39 41 55 76
Policeman 48 61 76 93
Teachers 46 58 75 96
14,19 In 1998, the average monthly wage for teachers was Rs 1,42,600. In 2002 the
average monthly wage for the same group was Rs 1,52,800. The consumer price index in
2002 using 1998 as the base period was 148. Calculate the real average monthly wage for this
groupin 1998.
14.20 Alocaljam manufacturing company feels that the sales are changing of its four most
best selling flavours the data for the years 2000 through 2004. Calculate fixed weight
aggregates index for each year using 2000 prices as the base and the 2004 quantities as fixed
weights.
Orange 58 62 | 69 79 21 25 20 18
Pine apple | 84 89 | 99 99 29 27 | 23 24