100% found this document useful (1 vote)
527 views292 pages

Business Statistics Book - OCR

Uploaded by

prince vashista
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
527 views292 pages

Business Statistics Book - OCR

Uploaded by

prince vashista
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 292

by tee etn

CIMT pare for Distance Leamns: om


: jana

eee
oye ». Normal Carve ® ase ae
sia ‘Sample: “Variance
Kc
ve Histoogram ‘D: ancientsi
ance ‘Normal Cane ocr foram
Histoo
gario
a Nari =D Danie
Variance
Seas Varlance
=enst Son sai «Da Census ve

Mean NeW Ys biex Saas =


: VETISUS att =a ws
oS we » Business Statistics gyeu ee mass
se”a oe Hypoh
to , ay)

sae cee SE * Rurt osis ic-V


“Sam ple:Aer
aria n cess:
aoe
Poe BY ce Typha
0s
yore eS (Ce
ensus: sor ao
Ne a3
Ws Dus HypeithFc Nermal- r teeNK a Sea
Ae ttiaes =
gam Data ane Na » Myce ‘ni Sone Cel \S , eRe

ne“Hpivinta
ang
weSal ta Sos
x8ae a es
NSW.

a . 3— ea ae
SAIN
wel

ae
Spee asi
meen rikBCS oH
“ys“ Sa
<aesseNO rmal
5 Revit’ Variance.
<u
Curve Mi
P

“Hypothesis Medi ve
a ee

Bugre Seer lata

Da
ala Skewiiess.. Kurtosis ”
Institute of
=” | Management Technology
IMT | Centre for Distance Leaming, Ghaziabad
eeee

VISION

Imparting continuum of management education through distance mode to learners across the globe.

MISSION
Be an academic community leveraging technology as a bridge to innovation and life-long learning.
To continuously evolve management competencies for enhanced employability and entrepreneurship.
To serve society through excellence and leadership in management education, research and consultancy.
Institute of OPMCO01
Management Technology Business
IMT | Centre for Distance Leaming, Ghaziabad Statistics
e000

INDEX

UNIT1
Defining and Collecting Data 1
UNIT2
Organizing and Visualizing Variables 18
UNIT 3
Numerical Descriptive Measures 37

UNIT4
Basic Probability 52

UNITS
Discrete Probability Distribution 65

UNIT 6
The Normal Distribution and Other Continuous Distributions 108
UNIT 7
Sampling Distributions 118
UNIT 8
Fundamentals of Hypothesis Testing: One-Sample Tests 126

UNIT9
Two-Sample Test 137

UNIT 10
Analysis of Variance 147

UNIT 11
Chi-Square Test 171
UNIT 12
Simple Linear Regression 194
UNIT 13
Simulation 221
UNIT 14
Index Number 242
ACKNOWLEDGEMENT

We acknowledge, with gratitude, the assistance taken, in preparing the study


material of the present course, from the texts, websites, and a/v sources cited at
different places within the Units. We, thankfully, also acknowledge the assistance
taken by us from the content generated by individual authors, publishing houses,
educational institutes, research agencies, consultancies, government bodies and
public sources of commercial organizations etc. (cited within the Units).
EXPERT COMMITTEE

Prof. (Dr.) S. R. Musanna Prof. (Dr.) S. Venkataramaniah


IMT CDL, Ghaziabad lIM, Lucknow

Prof. (Dr.) Ravindra Kumar Prof.(Dr.) Kunal Ganguly


IMT CDL, Ghaziabad IM, Kashipur

Prof. (Dr.) Subhajit Bhattacharya Mr. C. K. Mani


IMT, Ghaziabad Founder and CEO, Logchain, India

Prof. (Dr.) A. H. Kalro (Retd.) Prof. (Dr.) Asif Zameer


lIM, Kozhikode IMT CDL, Ghaziabad

Prof. (Dr.) P. K. Jain


IIT, Delhi

Prof. (Dr.) B. B. Chakraborti (Retd.)


IIM, Kolkata

SLM PREPARATION TEAM


Dr. Narendra Mohan Mishra Prof. (Dr.) Sumeet Kaur
IMT CDL, Ghaziabad FORE School of Management
New Delhi

Prof. (Dr.) Surendra Kumar Prof. (Dr.) Wajid Saiyed


Jaipuria Institute of Management, Noida Jamia Millia Islamia, New Delhi

COURSE COORDINATOR EXPERT REVIEW

Dr. Narendra Mohan Mishra Mr. Upal Chakraborty


IMT CDL, Ghaziabad Industry expert, Former CIO, Pepsi

PUBLISHER & PRINTER


Published by: Institute of Management Technology, Centre for Distance Learning, Ghaziabad

Publisher Address: A-16, Site 3, UPSIDC Industrial Area, Meerut Road, Ghaziabad
Printed by: Utility Forms Pvt. Ltd., A-23/B-1, Mohan Cooperative Industrial Estate, Mathura Road,
New Delhi- 110044; Phone No: 011-46757575: E-mail: [email protected]

Edition First (2021)


@ Institute of Management Technology, Centre for Distance Learning, Ghaziabad

ISBN: 978-81-951960-6-7
Allrights reserved. No part of this work may be reproduced in any form, by mimeography or any other means,
without permission in writing from Institute of Management Technology, Centre for Distance Learning.
Further information on Institute of Management Technology, Centre for Distance Learning may be obtained
from Institute’s Head Office at Ghaziabad or www.imtcdl.ac.in
OPMC001
Business Statistics

DEFINING AND COLLECTING DATA

STRUCTURE
1.0 Objectives
1.1 Introduction
1.2 Defining Variables
1.3 Measurement Scales
1.4 Collecting Data
1.5 The Methods of Data Collection
1.6 Different Ways of Collecting Samples
1.7. LetUsSumUp
1.88 KeyWords
1.9 References and Suggested Additional Readings
1.10 Self-Assessment Questions
1.11 Answers to Self-Assessment Questions
1.12 Check Your Progress — Possible Answers

1.0 OBJECTIVES
After reading this unit, you will be able to:

define Data and Information

define variables
understand different measurement scales

understand the process of collecting data


understand the different ways of collecting asample

1.1 INTRODUCTION
The business world of today is so intimately aligned with the universe of data and statistics
that it is impossible for a business or management professional to think of operations
disentangled from data processes and systems. Jim Gray, winner of the prestigious Turing
Award imagined the World to be data-driven. In this scenario, we can ignore data only at our
UNIT 1
Defining and Collecting Data

peril as it has become increasingly evident that collecting facts and figures and defining
variables for the business are crucial processes. This Unit discusses the meaning of data,
associated variables, scales of statistical measurements, and related processes. It takes you
on a guided tour into the world of data and statistics and explains the essential features of
their utility and application for management professionals.

As an initial exercise to understand the properties of Data and Statistics, you may attempt to
answer the following two questions:

Which of the following data is countable, and which one is measurable?

® Thevolume of water released from the dam.


e Thecolour of student's eye.

e Thenumber of employees working ina company.


Caselet-1

You are the Sales Manager incharge of the best-selling washing machine in its category.
For years, your chief competitor has been making incremental sales gains, claiming a
better washing machine. Worse, a new sibling product from your company, known for
its good quality, has rapidly gained significant market share at the expense of your
product. Worried that your product may soon lose its number one status, you seek to
improve sales of the product by improving its after-sales Service. You experiment and
develop a new after-sales service process. You conduct surveys and discover that
people overwhelmingly like the newer formulation, and you decide to use the new
formulation since statistical evidence has shown that people prefer the new
formulation. What could go wrong?

You may now realize that much did go wrong. The above case tells us that if we choose the
wrong variables, we may notend up with results that support making better decisions.
As an initial note, Statistics is a way of thinking that can help fact-based decision-making.
Statistics is the branch of mathematics that transforms numbers into useful information for
decision-makers. Statistics provides a way of understanding and then reduces— but does not
eliminate — the variation that is part of any decision-making process. Statistical data also tell
us the known risks associated with decision-making. Statistics achieves this by providing a
set of methods for analyzing numbers. These methods help us to find patterns in numbers
and enable us to determine whether differences in the numbers are merely due to chance.
As we progress in this unit, we will learn these methods and will also learn the appropriate
conditions for using them.

1.2. DEFINING VARIABLES

e Variables are numbers, amounts, or situations that can change, that are not rigid
or static, and have the propensity to transform, modify, or shift from their initial
OPMCO001
Business Statistics

locus. The idea of a variable originated from the work of the French
mathematician Francois Viete towards the end of the sixteenth century. He
brought into practice the method of representing known or even unknown
numbers, by letters and practicing computation with them as if they were
numerical entities.

Statisticians classify variables as either being categorical or numerical and further


classify numerical variables as either discrete or continuous.

1.2.1 CATEGORICAL VARIABLES

Categorical Variables are also known as qualitative variables and are values that
can be categorized, like a ‘yes’ ora ‘no’.

Do you currently own stocks and bonds? “Did you buy any of the shoes advertised
in the flyer with today's newspaper?”, are examples of categorical variables, all of
which have 'yes' or 'no' as their values. Categorical variables can have more than
two possible responses. For example, asking customers to indicate the day of the
week on which they made their purchases.
Question. Gender with its categories male and female is an example of a

1.2.2 NUMERICAL VARIABLES

These are also Known as quantitative variables that carry values, representing
quantities. For example, the response to the question “How much money do you
expect to spend ona stereo?” is anumerical variable.
Numerical variables are further subdivided as discrete or continuous variables.
Discrete variables have numerical values that arise from a counting process. The
number of magazines subscribed to is an example of a discrete numerical variable
because the response is one of a finite number of integers. You subscribe to
number zero, one, two, and soon. The number of items that a customer purchase
is also a discrete numerical variable because we are counting the number of items
purchased.
Continuous variables produce numerical responses that arise from a measuring
process. The time you wait before an ATM generates the requested cash or the
time spent waiting for a web page to load or the temperatures of a body are
examples of continuous numerical variables because the responses take on any
value within a continuum or interval, depending on the precision of the
measuring instrument. For example, your waiting time could be 1 minute, 1.1
minutes, 1.11 minutes, or 1.113 minutes, depending on the accuracy of the
measuring device you use. The following simple fill in the blanks will further
clarify the nature and meaning of numerical variables:
UNIT 1
Defining and Collecting Data

NOTES What is your height? centimetres.

° Is ita discrete or continuous variable?

How many employees are working in your organization?

o Areyoucounting or measuring?

oO Is ita discrete or continuous variable?

1.3 MEASUREMENT SCALES


The measurement scale defines the ordering of values. It also determines if there
do exist differences among pairs of values for a variable and whether they are
equivalent. The measurement scale also determines whether we can express one
value in terms of another. We have noted the possibility of having more than one
measurement scale given the multi-functional role of measurement scale/s.

In Table 1.1 we present examples of measurement scales, some of which are used
in the remaining parts of this section. We define numerical variables as using
either an interval scale, which expresses a difference between measurements
that do not include a true zero point, or a ratio scale or an ordered scale that
includes a true zero point.

If a numerical variable has a ratio scale, we can characterize one value in terms of
another. We can say that the item cost (ratio) Rs 2 is twice as expensive as the item
cost Re 1.

It must, however, be clearly understood that because Centigrade temperatures


use an interval scale, 2°C does not represent twice the heat of 1°C. Hence, we
must be careful before defining ratios.

For both interval and ratio scales, what the difference of 1 unit represents
remains the same among pairs of values, so that the difference between Rs 11
and Rs 10 represents the same difference as the difference between Rs 2 and Re1
(and the difference between 11°C and 10°C represents the same as the difference
between 2°C and 1°C).
Categorical variables use measurement scales that provide less insight into the
values for the variable. For data measured on a nominal scale, category values
express no order or ranking. For data measured on an ordinal scale, ordering or
ranking of category values is implied. Ordinal scales provide to us information to
compare values but not as much as interval or ratio scales. For example, the
ordinal scale poor, fair, good, and excellent provides us the knowledge that
“good” is better than poor or fair, and not better than excellent. But unlike
interval and ratio scales, we do not know that the difference whether “poor to
fair” is the same magnitude as "fair to good” or “good to excellent”
OPMC001
Business Statistics

Table 1.1 : Examples of Different Scales and Types

Data Scale type Value

Automobile manufacturers Nominal Maruti, Honda

The salary difference between Interval Rs 40,000


CEO and Ramesh
Grades in the examination as Ordinal Grade A, 5th position
per marks position in the class
The ratio of Postgraduate and Ratio 5:4
graduate students in the organization

1.4 COLLECTING DATA


The collection of data is extremely important. It must be based on objective grounds, the
scales properly defined, and the instruments checked and rechecked before used. Any error
has the potential of vitiating the entire process.

1.4.1 POPULATION AND SAMPLES

The numbers or the data collected are meaningless unless all variables have operational
definitions. These definitions have universally accepted significance, clear to everyone
associated with the process of analysis. Even though the operational definition for sales per
year might seem clear, miscommunication could occur if one person was referring to sales
per year for the entire chain of stores and another to sales per year per store, or one person
is measuring in 4-4-5 weeks for the quarter and another person till the end of the month. It is
imperative to familiarize oneself with the basic vocabulary of statistics as discussed below:
Variables:

Variables are characteristics of items or individuals. When we use a statistical method, we


usually analyze a variable. Sales figures, expenses by year, or net profit by year are examples
of variables that decision-makers may wish to analyze. When used in everyday speech,
variable suggests an act or quantum that varies, and we would expect the sales, expenses,
and net profit to have different values from year to year. These different values are the data
associated with a variable, and more simply, the data to be analyzed. Variables can differ for
reasons other than time. For example, if we have analyzed the composition of a large
postgraduate lecture class, we may wish to include variables such as the gender of the
students, the courses undertook at the undergraduate level, and the city each student hails
from. These variables would vary because each student in the class is different. One could be
a male, B. Com graduate from Pune, while another may be a female, Economics graduate
from Delhi.

Population (N):

“N” symbol is used to represent the population. Four other basic vocabulary terms are
population, sample, parameter and statistics. A population consists of all the items or
UNIT 1
Defining and Collecting Data

NOTES individuals about which we wish to acquire information. Data of the sales transactions of an
electronics goods store, the number of customers who shopped at a city mall during a
specific weekend, the number of students who enrolled for a part-time course at a particular
college in a particular year, the registered voters of a town called Ramgarh, are examples of a
population.

Sample (n):
“n” symbol denotes a sample. A sample is the portion of a population selected for analysis.
Therefore, if we say that 200 sales transactions of the electronic goods store are randomly
selected by an auditor for study or 30 shoppers at the mall are asked to complete a customer
satisfaction survey, or 50 part-time students are selected for a marketing study, or 500
registered voters in Ramgarh are asked whom they voted for, we are referring to asample. In
each sample, the transactions or people in the sample represent a portion of the items or
individuals that make up the population.

Parameter:

A parameter is a numerical measure that describes the character of a population. The mean
amount spent by all customers who shopped at the city mall during the weekend is an
example of a parameter because the amount spent by the entire population is what we are
looking for. In contrast, the mean amount spent by the 30 customers completing the
customer satisfaction survey is an example of a statistical survey because the amount spent
by onlya sample of 30 people is required.

1.4.2 DATACOLLECTION

Examples of Data sources are:

e Capturing revenue and profit figures from the balance sheet of a business
organization.

e Compiling the responses froma survey.


e Conducting an interview or an observational study.

e Newspapers, magazines, database, e.g. GDP, per-capita income, literacy rates of a


country.
Data collection involves collecting the values for variables. Data collection could be required
under several circumstances. For example, a marketing research analyst who needs to
assess the effectiveness of a new television advertisement, a pharmaceutical manufacturer
who wishes to determine whether a new drug is more effective than those currently in use,
the Operations Manager who wants to improve a manufacturing or service process, or an
auditor who wants to review the financial transactions of a company to determine whether
the company complies with generally accepted accounting principles, may all need to collect
data for further analysis. In each of these examples, collecting data from every item or
individual in the population would be difficult and time-consuming. Since these are typical
cases, data collection involves collecting data from asample.
OPMCO001
Business Statistics

Primary and Secondary Data: NOTES


Data sources are classified as either primary or secondary. When the data collector is the
one using the data for analysis, the source is considered primary. When the person
performing the statistical analysis is not the data collector, the source is secondary.
Organizations and individuals that collate and publish data collect data as a primary source
and then let others use it as a secondary source. For example, the government collects and
distributes data in this manner for both public and private purposes. The Indian Ministry of
Labour and Employment collects data on employment and distributes the monthly All-India
consumer price index. The Census of India under the Ministry of Home Affairs oversees a
variety of surveys regarding population, housing, and soon.

1.5 THE METHODS OF DATA COLLECTION


There are several ways to collect data from a population. Data distributed by an organization
or an individual can be collected by another organization or individual for their purpose.
Data collected through a designed experiment, a survey, or an observational study comprise
other forms of data collection.
Data Distributed by an Organization or Individual:

Market research firms and trade associations distribute data on specific industries or
markets. Investment services provide financial data on a company-by-company basis.
Syndicated services such as AC Nielsen provide clients with data that enables the
comparison of the market share of client products with those of their competitors. Daily
newspapers are filled with numerical information regarding stock prices, weather
conditions, and sports statistics.
Designed Experiment:
Outcomes of a designed experiment are another data source. These outcomes are the result
of an experiment, such as a test of several laundry detergents to compare how well each
detergent removes a certain type of stain. Developing proper experimental designs is a
subject beyond the scope of this study material because such designs often involve
sophisticated statistical procedures.

Questionnaire (Survey form):


Conducting a survey or census using a Questionnaire is the third type of data source. People
being surveyed are asked questions about their beliefs, attitudes, behaviours, and other
characteristics. For example, people could be asked their opinion about which laundry
detergent best removes a certain type of stain. This could lead to a result different from a
designed experiment seeking the same answer.
Observational Study:
Conducting an observational study is the fourth important data source. Aresearcher collects
data by directly observing a behaviour, usually in a natural or neutral setting. Observational
studies are a common tool for data collection in business. For example, market researchers
UNIT 1
Defining and Collecting Data

use focus groups to elicit unstructured responses to open-ended questions posed by a


moderator to a target audience. Observational study techniques are also used to enhance
teamwork or improve the quality of products and services. Identifying the most appropriate
source is a critical task because if biases, ambiguities, or other types of errors flaw the data
being collected, even the most sophisticated statistical methods will not produce useful
information.
Advantages and Disadvantages of Primary Data over the Secondary Data:
Based on our understanding, write the advantages and disadvantages of Primary and
Secondary Data in the contingency Table 1.2 appended below:
Table 1.2: Advantages and Disadvantages of Primary Data over the Secondary Data

Primary Data Secondary Data

Advantages

PP

PWN
WN
PWNHPIB
Disadvantages

PWNHeE|
e Census and Sampling Methods:

Primary data becomes highly necessary whenever the secondary data is not
available. The primary data can be obtained either by the census method or by
the sampling method.

e Census Method

When the researcher collects data from every member of the population (N), it is
referred to as the census method or complete enumeration method. For
example, censuses on every individual are conducted every ten years in India.

Advantages
¢ Information regarding each member of the population can be obtained. The
information collected is more accurate.

Disadvantages
e Itrequires a lot of time and a huge amount of money and can easily result in
duplication, exclusion, etc. because of the humongous effort involved.

e Sampling Method
When the researcher collects data from a few individuals from the population,
they are known as the respondents and constitute the sample (n). It is referred to
as the sampling method. The process of collecting the primary data from the
sample is known as a Survey. E.g. AC Nielson conducts surveys on market shares
OPMCO001
Business Statistics

of various competitors in an industry. NOTES


Advantage:

° Quicker and less costly


Disadvantage:

e Thechances of error in the finding are very high.


Survey Design
A survey design includes designing of questionnaire, pretesting, and editing the
primary data.
Aset of questions in the data collection form is called the Questionnaire.

Survey Design Exercise


“Google Forms” provide an easy way to create a survey that contains as many
questions as they need to ask, in a variety of styles. From planning an event to
obtaining anonymous answers to tough questions, there are lots of useful tools in
Google Forms.

From multiple-choice questions to a linear scale, Google Forms provide a handful


of options for asking questions. We are also provided the facility to embellish our
survey to fit its theme, opt to make certain questions mandatory, etc.

While “Google Forms” surveys are typically sent and answered via email, we can
also get respondents to fill in answers on a web page, embed the questionnaire
ona site and share it via social media. Here are our step-by-step instructions for
how to create a survey with Google Forms.
Navigate to https://docs.google.com/forms/ and click Blank. Google Forms has several pre-
made templates to choose from, and we can view them all by clicking More.

Fig. 1.1: Pre-made Templates

rey ©

F
EELS er od etn ale) PAT las A T-Shirt Sign Up

Name your survey. You can also add a description. If you wish to name the Google
Form for your reference, click untitled form in the top left corner and edit.
Tap on untitled question and compose a question.

Click multiple choice if planned that way.


Select an option for how the question will be answered. For all options except for
short answer, paragraph, date and time, you will have to provide alternative
UNIT 1
Defining and Collecting Data

NOTES options as answers.


e Shortanswer and paragraph provide recipientsa blank field to fill in.

e Multiple choice lets users select one answer from a series of options, while
Checkboxes allows users to select multiple answers.

¢ Dropdown provides recipients a field to click that reveals a menu they would
select an answer from.
¢ The linear scale allows users to answer by selecting a rating from a range such as 1
to5.
e Date andtime allow recipients
to selecta date or time.

Fig. 1.2: Dietary Restrictions

Dietary Restrictions
I'm trying to plan a group meal, so let me know what you can and car't est.

Do you have any allergies?


© option1

Click the side menu icons to add to the survey.

® The Plus button adds another question.

e =TheT, button lets you add a section title and description.


® The Photo and Video buttons allow you to illustrate your survey.

® Thetwo rectangles icon allows you to break your survey up bysections.


Fig. 1.3: Menu Icons

10 Click the Required switch to make a question mandatory. Click the duplicate or
OPMCO001
Business Statistics

trash icons to clone or erase the question. NOTES

D OD | Required

Repeat previous steps until all questions are completed.


Click the Palette icon to change your survey's colour or add a photo to the header.

Click the Eye icon to preview your survey.


Click the Gear icon to access survey settings.
Click Send.
Enter recipients. Check off "Include form in the email" if you wish your
respondents to answer questions from their email client. Not all clients support
this. Outlook, for example, will make you click a button to open the survey in a
browser.
Fill in a subject line and message. People typically need a little coaxing
to answera
survey.
Click Send. If you want to share the survey via a hyperlink, it can be obtained by
clicking the link icon. To get code for embedding the survey on a website click the
<> icon. You can also share the survey via social media with Google+, Facebook,
and Twitter buttons.

Now that your survey is sent, your audience is expected to answer. To view what
your recipients said, click on Responses.

1.6 DIFFERENT WAYS OF COLLECTING SAMPLES


The “different ways of collecting the sample” is called sampling. Sampling is a familiar part of
daily life. A customer in a bookstore picking a book is a typical example of sampling. A subset
of some part of a large population is called the sample. Several alternative ways to select
samples are available. The main alternative sampling plans may be grouped into two
categories.

Two major categories of sampling are

1.6.1 PROBABILITY SAMPLING

Known, non-zero probability for every element to be included in the sample. Each element
of the population has an equal chance to get selected. Type of probabilistic sampling are:

Simple Random Sampling: A sampling procedure that ensures that each


element in the population will have an equal chance of being included in the
sample.
11
UNIT 1
Defining and Collecting Data

NOTES * Systematic Sampling: A simple process. Every nth name from the list will be
drawn.

e¢ Stratified Sampling: Probability sample. Sub-samples are drawn within different


strata. Each stratum is based on a characteristic. Not to be confused with the
quota sample. For example, we may wish to extract samples separately from
individuals with allegiances to different religions. However, we still follow the
random sampling method and then adjust. For example, if we desire an equal
number of men and women but land up with 600 men and 300 women, we select
300 men from the 600 selected men once again through random sampling.

e Cluster Sampling: The purpose of cluster sampling is to sample economically


while retaining the characteristics of a probability sample. The primary sampling
unit is no longer the individual element in the population. The primary sampling
unit is a larger cluster of elements located in proximity to one another.

1.6.2 NON-PROBABILITY SAMPLING


The probability of selecting any member is unknown. Everyone does not have the same
chance to get selected. Type of Non-probabilistic sampling are:

e Convenience sampling: Also called haphazard or accidental sampling. The


sampling procedure of obtaining the people or units that are most conveniently
available. In other words, just select — for example, the first person you meet on
the road.
e Judgment Sampling: Also called purposive sampling, an experienced individual
selects the sample based on his or her judgment about some appropriate
characteristics required of the sample member.
® Quota Sampling: Ensure that the various sub-groups in a population are
represented on pertinent sample characteristics, and selection is done on each of
these quotas then. To the extent the investigators desire. It should not be
confused with stratified sampling. In stratified sampling, we use probabilistic
sampling, but in quota sampling, we use non-probabilistic sampling. For example,
we select 300 men and 300 women (fixed) from a population.
e Snowball sampling: A mixture of procedures, like initial respondents selected by
probability methods, but additional respondents obtained from information
provided by the initial respondents.

12
Exercise on MS Excel:
Simple Random Sample Key Technique:

Use the RANDBETWEEN (smallest integer, largest integer} function to generate a


random Integer that can then be used to select an Item from a frame. Example 1 Create
asimple random sample with replacement of size 40 from a population of 800 tems.
Workbook Enter a formula that uses this function and then copy the formula down a
column for as many rows as Is necessary. For example, to create a simple random
sample with replacement of size 40 from a population of 800 items, open to a new
worksheet. Enter Sample in cell Al and enter the formula = RANDBETWEEN (1, 800} in
cell AZ. Then copy
the formula down the column
te cell A41

Home iment Page Layout Fermelas = Outy view = View Help LP Semch ure «Cl Comments
See] acele- | owt cre | st | SS | Ev | F
OF O/B} as Ay) FS SS | Mun ce | $9 eT | Soe eeeCe | eet ete | Oe teereeete |
fon I . uaner Spee Cate Ln J ieee a

“Ex | omANCaKTWeENt. 50) ; a = =


lly ep et joe fT ee Pou ey ee a es ron ee fe si

+; _;_|_{_|___§&
:

Exercise:

Now, Ifcan Yes/N

Identify the type of variables

Collect the primary and secondary data

Find the scale of measurement for the data set

Create a sample for data collection

13
UNIT 1
Defining and Collecting Data

NOTES Complete the Concept Map:

Type of
variables and
examples

1.7 LETUSSUMUP

In this unit, we learned the detail of defining and collecting data, the type of variables.
Statistics is the science of collecting data, analyzing, presenting, and interpreting data. Data
consist of facts and figures. We learned the methods of data collection and specifically
learned the vocabulary of population, sample, and sampling methods.

1.8 KEYWORDS

Data: The facts and figures collected, analyzed and summarized for presentation and
interpretation.

Variable: A characteristic of interest for the elements.

Nominal scale: The scale of measurement for a variable when the data are labels or names
used to identify an attribute of an element. Nominal data may be numeric or non-numeric.

Ordinal scale: The scale of measurement for a variable if the data exhibit the properties of
nominal data and the order or rank of the data is meaningful.

1.9 REFERENCES AND SUGGESTED ADDITIONAL READINGS


Lohar, $.L sampling Design and Analytics, 2" ed. Boston,
https://drive.uqu.edu.sa/_/maatia/files/Sampling.pdf

Levine, David M., Stephan, David F., Statistics for Managers Using Microsoft Excel, 8" edition
by Pearson Education.

1.10 SELF-ASSESSMENT QUESTIONS


(AConcept Question):

The oranges grown in corporate farms in an agricultural state were damaged by some
unknown fungi a few years ago. Suppose the manager of a large farm wanted to study the
impact of the fungi on the orange crops daily over 6 weeks. On each day, arandom sample of
14
OPMCO001
Business Statistics

orange trees was selected from within a random sample of acres. The daily average number
of damaged oranges per tree and the proportion of trees having damaged oranges were
calculated. The two main measures calculated each day (i.e., the average number of
damaged oranges per tree and the proportion of trees having damaged oranges) are called
and

CHECK YOUR PROGRESS


SCENARIO1

Times of India poll asked 2,150 adults in India, a series of questions to find out their view on
the Indian economy.

Q.1 Referring to Scenario 1, the population of interest is:

a} allthe males living in India when the polls were undertaken.


b) allthe females living in India when the polls were undertaken.

c) allthe adults living in India when the polls were undertaken.


d) allthe people living in India when the polls were undertaken.

Q.2 Referring to Scenario 1, the 2,150 adults make up:


a) thepopulation
b) thesample

c) theprimary datasource
d) thesecondary datasource

Q.3 Referring to Scenario 1, the possible responses to the question "How satisfied are you
with the Indian economy today with 1 = very satisfied, 2 = moderately satisfied, 3 =
neutral, 4= moderately dissatisfied and 5 = very dissatisfied?” are values froma

a) discrete variable
b) continuous variable

c) ordinalvariable

d) table of random numbers


Q.4 Referring to Scenario 1, the possible responses to the question "How many people in
your household are unemployed currently?" are values froma
a) discrete numerical variable.

b) continuous numerical variable.


c) categorical variable.

d) table of random numbers.

15
UNIT 1
Defining and Collecting Data

Q5 Referring to Scenario 1, the possible responses to the question "How would you rate
the condition of Indian economy with 1 = excellent, 2 = good, 3 = decent, 4 = poor, 5 =
terrible?" resultin

a) anominalscale variable
b) anordinal scale variable

c) aninterval scale variable


d) aratioscale variable

Q.6 Referring to Scenario 1, the possible responses to the question "Are you 1. Currently
employed, 2. Unemployed but actively looking for a job, 3. Unemployed and quit
looking fora job?" resultin:

a) anominalscale variable.
b) anordinal scale variable.

c) aninterval scale variable.


d) aratioscale variable.

Q.7 Referring to Scenario 1, the possible responses to the question "In which year do you
think the last recession in India started?" resultin
a) anominal scale variable.

b) anordinal scale variable.


c) anintervalscale variable.

d) aratioscale variable.
Qa8 Referring to Scenario 1, the possible responses to the question "On the scale of 1 to
100 with 1 being extremely anxious and 100 beings not anxious, rate your level of
anxiety in this Indian economy" results in:
a) anominal scale variable.

b) anordinal scale variable.

c) aninterval scale variable.


d) aratioscale variable.

Q.9 The universe or "totality of items or things" under consideration is called


a) asample.

b) apopulation.

c) aprimary datasource
d) asecondary datasource

16
OPMC001
Business Statistics

Q.10 The portion of the universe that has been selected for analysis is called

a) asample

b) aframe
c) aprimarydatasource

d) asecondary data source

1.11 ANSWERS TO SELF-ASSESSMENT QUESTIONS

ANSWER: statistics
ANSWER: parameters

1.12 CHECK YOUR PROGRESS - POSSIBLE ANSWERS


Q1ic
Q2 b
Q3 ¢
Q4 a

Qs b
Q6 a

Q7 ¢
Qs b
Q9 b
Q.10 a

17
ORGANIZING AND VISUALIZING
VARIABLES

STRUCTURE
2.0 Objectives
2.1 Introduction
2.2 Organizing Categorical Variables
2.3. Organizing Numerical Variables
2.4 Visualization or Categorical Data
2.5 Visualizing Numerical Variables
2.6 Visualizing Two Numerical Variables
2.7. Organizing
and Visualizing a Mix of Variables
2.8 TheChallenge in Organizing and Visualizing Variables
2.9 LetUsSumUp
2.10 KeyWords
2.11 Self-Assessment Questions
2.12 Answers to Self-Assessment Questions
2.13 Check Your Progress—Possible Answers

2.0 OBJECTIVES

After reading this unit, we will be able to:

e summarize qualitative data

® summarize quantitative data

e learn aboutthe business application of pictorial representation


e use excel for tabular and graphical presentation

2.1 INTRODUCTION

In this unit, we will learn about the organization and visualization of data in systematic
statistics. Basically, facts and figures are called data. In statistics a comprehensive process
describes collection of data and organizing the datain an understandable and readable form.
This unit introduces tabular and graphical methods of commonly used summaries of both
18
OPMC001
Business Statistics

Categorical and numerical variables. Tabular and graphical summaries of data are found in
the annual reports, newspaper articles, business magazines, reference books, etc. We are all
exposed to different types of variables and their measurements. In the previous unit, the
term “statistics” was defined as the collection, organization, analysis, interpretation and
presentation of Data. We begin with tabular and graphical methods for summarizing data
concerning a single variable. The section 2.7 introduces the methods of summarizing data of
Mix variables.
Field work:

Take a round in your colony or your IMTCDL campus. Find out how many types of trees you
can see there. Do you know their names? You can make drawings. Use tally marks to note the
number of different trees.

Name of the tree Tally marks Number of tree

2.2 ORGANIZING CATEGORICAL VARIABLES

Avariable isa symbol (e.g., X, Y oro) that represents any of a specified set of values.
For example, suppose variable X represents the percentage of defective units in a shipment
of widgets. Since Xis a percentage, the variable
X could take on any value between O and 100.
Variables can be classified as categorical (aka, qualitative) or quantitative (aka, numerical).

Categorical variables take on values that are names or labels. The colour of a ball (e.g., red,
green, blue) or the breed of a dog (e.g., collie, shepherd, terrier) would be examples of
categorical variables.

2.2.1 FREQUENCY DISTRIBUTION

Atabular summary of data showing the number (or frequency) of items in each of the several
non-overlapping classes is called frequency distribution.
Let us look at the raw data from the unit2_workbook_ex1 as:

19
UNIT 2
Organizing and Visualizing
Variables

NOTES Table 2.1: Sample of 50 soft drink purchased by Ram


Coke Classic Coke Classic Coke Classic
Pepsi Pepsi Pepsi
Diet Coke Diet Coke Diet Coke
Dr. Pepper Dr. Pepper Dr. Pepper
Sprite Sprite Sprite
Coke Classic Coke Classic
Diet Coke Pepsi Coke Classic
Diet Coke Diet Coke Coke Classic
Dr. Pepper Dr. Pepper Pepsi
Sprite Sprite Diet Coke
Coke Classic Coke Classic Dr. Pepper
Pepsi Pepsi Sprite
Diet Coke Diet Coke Dr. Pepper
Dr. Pepper Dr. Pepper Dr. Pepper
Sprite Sprite Sprite
Diet Coke Diet Coke Diet Coke
Diet Coke Pepsi

To develop a frequency distribution for these data, we count the number of times each soft
drink appears in the Table 2.1 and summarize it in the frequency distribution table as Table
2.2. The number counted pertaining to each type of drink is called the frequency, and the
table in which a column contains the frequency is called the frequency distribution.

Table 2.2: Frequency Distribution

SL.NO. Name of the drink number


1 Coke Classic 9
2 Pepsi 8
3 Diet Coke 13
4 Sprite 9
5 Dr. Pepper 11
Total 50

The frequency distribution provides a summary of how the 50 soft drinks were purchased by
Ram. This distribution offers the consumption pattern.
Howto make frequency distribution table using excels:

Step 1: Open the excel worksheet: unit2_workbook_ex1


Step 2: Select cell 15

Step 3: Enter
= COUNTIF (S$B$5: $D$21, H5)
Step 4: Copy cell 15 to cell 16 to 19.

20
OPMCO001
Business Statistics

2.2.2 RELATIVE FREQUENCY AND PERCENT FREQUENCY DISTRIBUTION

Arelative frequency is the fraction or proportion of items belonging to a class. For a class set
of total n observation, the relative frequency is calculated as:

Frequency of the class


Relative frequency of the class =
n
The relative frequency represents the proportional numbers for a particular class. The
percent frequency out of 100 in that particular class is represented by:
Percent frequency = Relative frequencyX 100

Now, let us try to calculate relative frequency and the percent frequency of the frequency
distribution of Table 2.2.

We must divide frequency of each class by the total frequency to get the relative frequency
and the dataset or the table in which a column contains the relative frequency known as
relative frequency distribution table as

Table 2.3: Relative Frequency Distribution

SI. No. Name of the drink Frequency | _ Relative frequency


1 Coke Classic 9 9/50=0.18
2 Pepsi 8 8/50-0.16
3 Diet Coke 13 13/50=0.26
4 Sprite 9 9/S0=0.18
5 Dr. Pepper 11 11/50=0.22
Total 50

Exercise 1: Develop the percent distribution Table:


Table 2.4: Percent Distribution

SI.No. | Nameofthe drink | Frequency | Relative frequency) Percent Frequency


1 Coke Classic 9 0.18 18
2 Pepsi 8 0.16 16
3 Diet Coke 13 0.26 26
4 Sprite 9 0.18 18
5 Dr. Pepper 11 0.22 22
Total 50

2.3 ORGANIZING NUMERICAL VARIABLES

Quantitative variables are numerical. They represent a measurable quantity. For example,
when we speak of the population of a city, we are talking about the number of people in the
city - a measurable attribute of the city. Therefore, population would be a quantitative
variable.

21
UNIT 2
Organizing and Visualizing
Variables

NOTES 2.3.1 DISCRETE VS. CONTINUOUS VARIABLES

Quantitative variables can be further classified as discrete or continuous. If a variable can


take on any value between two specified values, it is called a continuous variable; otherwise,
itis called a discrete variable.

Some examples will clarify the difference between discrete and continuous variables.

Suppose the fire department mandates that all fire fighters must weight between 150 and
250 pounds. The weight of a fire fighter would be an example of a continuous variable; since
a fire fighter's weight could theoretically take on any value between 150 and 250 pounds,
even though it may not practically encompass all values.
Suppose we flip a coin “n” number of times and count the number of heads. The number of
heads could be any integer value between 0 and plus infinity. However, it cannot be any
number between O and plus infinity—for example 2.5 heads. Therefore, the number of heads
must be a discrete variable.

In simple word, we can say the data which come from measurement are called Continuous
Variables, and the data which come from counting are called Discrete Variables.

2.3.2 FREQUENCY DISTRIBUTION FOR NUMERICAL VARIABLES


As defined in Section 2.1, a frequency distribution is a tabular summary of data showing the
number (frequency) of items in each of several non-overlapping classes. This is applicable for
both types of variables. Converting the numerical, raw data into the tabular form with the
frequency of each class is called the frequency distribution.
Let us take a small case study

Frooti, a most popular product of Parle AGRO got a regular complaint on the CRM that the
Frooti 160ml pack is packing less than 160 ml frequently. Then, as a Business Manager of the
company, we have to collect random samples from different retailers. Suppose, we collected
20 samples of tetra pack Frooti from the market, measured the volume of each pack, and
noted themas follows:

Table: 2.5: Samples of Tetra Pack


Frooti (160 ML pack) weight measured as (in ML)

Sample size (n) =20

156 156 156 156


164 167 160 158
158 165 162 164
161 163 161 146
160 159 163 170

This data is an example of a continuous variable because it came from measurement.

22
OPMCO001
Business Statistics

Refer the excel worksheet unit2_workbook_ex2.

The three-steps necessary to define the classes for the frequency distribution with
quantitative data are:
1. Determine the number of non-overlapping classes.

2. Determine the width of each class.


3. Determine the class limits.
Let us try to develop the frequency distribution table of the above Table 2.5.

Number of classes: As a general guideline we recommend using between 5 and 20 classes.


For the small set of data items up to 100 use 5 to 6 classes and for the large set of data items
more than 100 use 10 to 20 classes. As in our case the size of the data item is 25, therefore,
we created 5 classes.
Class width: The second step in constructing the frequency distribution table is to decide the
class width inthe following manner:

Largest data value - smallest data value


Approximate class width =
number of classes

In our case we use MAX function of Excel and found the largest data value, then we used MIN
function of excel and found the smallest data value, the number of classes is already decided
as5.

Now, largest data Value =170

Smallest data Value =146

Class width = im = 4.8 (Approx 5)

Now to develop the table we have to incorporate the smallest value in the first class , starting
with 146 and then add 4 i.e. 150 (since it is inclusive of both numbers).
So, our first class is 146-150, the second class is 151-155 and soon.

The Class limit must be chosen so that each data item belongs to one and only one class. In
case of Qualitative Data, chances of overlapping are not there, therefore the class limit
decision is not required in case of categorical data.

Now, the frequency distribution is:


Table 2.6: Frequency Distribution of Frooti Tetra Pack Data
Class Upper Limit Frequency
146-151 150 1
151-156 155 0
156-161 160 9
161-166 165 8
166-171 170 2
Total 20 23
UNIT 2
Organizing and Visualizing
Variables

NOTES Step for developing frequency distribution using excels:


Step 1: Open the Excel worksheet unit2_workbook_ex2.

Step 2: Select Cells L8:L12


Step 3: Type, but do not enter, the following formula:

= FREQUENCY (B7:E11, K8:K12)


Step 4: Press CTRL+SHIFT+ENTER and array formula will be entered into each of the cells
L8:L12. Table 2.6 provides the frequency distribution. Most frequently occurring
underweight class is 156-160 ml. 9 of the 20 tetra pack weights is between 156 and 161. One
can draw many conclusions from the above frequency distribution.
Relative frequency of the class and the percent frequency can also be calculated.
Exhibit 1:

Ld ome mut Pagelayou = formu «Ome


View Helps cae

Gs Cals <0 <A K Za e- Bw Gmreow - Hy - = . = ov p $ v


rumwg >”|S Puy Hei
Be é.
OA FEE ge
GT Beepate- $-%9 * DE Conder
e Fammat
ee as wee
Cale | home
mantel bra
O- fetes ona
Onprows ® towe s ee * Sembee s ae Cm tag Dal erretety “

e & (FREQUENOHT
ET £12 C129)

a A ‘ c ° £ ‘ 6 ” Ef 4 x i a n ° ’ {s)
1
2) athe:§
2 Feont! [ B50 Mi pack) weight mesured 1 Table 6 mabe
‘4 Pet howe

s| Seampte size (np » Frequency Duamunen


of freon Tew Pack Dana
‘ 2
? a

a| ‘‘ i
i
"4 ‘ ;|
R
a z :
M) Largest data value + i” of
Ss Semiiest Osta Valves ae ® |
Cy Table? 2
v7 Chass Intervet as . Frequency Distritution of Frooti Tetra Pact Deta

.
* [hess than ¢¥ than oF
» 16-15 rr i i
n 231-6 1 % h
2 136-163 nas 2 st x |
5.a : [mt | ot | ceet | ed | ot | Petcare | | @
161-66
eS 163.5)
2 4
ee C
E=
fave a * s + om

Exercise:

Calculate Relative frequency and Percent frequency of the data set in below table:
Table 2.7: Frequency Distribution of Frooti Tetra Pack Data

Class Upper Limit Frequency | Relative Percent


Frequency Frequency
146-151 | 150 1
151-156 | 155 0
156-161 | 160 9
161-166 | 165 8
166-171 | 170 2

24
OPMCO0L
Business Statistics

2.4 VISUALIZATION OF CATEGORICAL DATA


Till now, we have learnt that the tabular representation of qualitative data is called
categorical data. The tabular form of representation is mostly useful for operational
decisions. The visualization or representing the data In the form of Graphs and picture are
used at the managerial level because Graphs are more user-friendly and provide a quick
understanding about the data trends.
Incase of categorical
data,
1. BarGraphs

2. Plecharts
3. Pareto Chart
are used to visualize:
Bar Graphs
and Pie charts:
Bar graphs is a graphical device for summarizing the frequency distribution in the form of
elther a horizontal bar or vertical bar. We portray below the different classes along x — axis
and frequency along Y-axis. For easy understanding, we provide the bar chart of the above
illustration for easy reference.
Fig. 2.1: Bar Chart for the Categorical Data

Bar chart of Soft Drink consumption


Frequency

m number

Coke Classic pepsi Diet Coke Sprite Dr. Pepper

Name of the Drink

Howto make chart using Excel is explained on the unit2_werkbook_ex1.


The Pie Chart was invented by Florence Nightingale, the famous nurse working in a US
hospital.
In most of the Business problems, the most frequently used chart Is Ple Chart, based on the
principle of converting the frequencies In the ratios of corresponding angles on the arc of a
circle. The formula used for converting the frequency to the angle is

Angle= Frequency of the class X360


nn
UNIT 2
Organizing and Visualizing
Variables

For example, In the above table, the angle for the Pepsi is calculated as : (8/50) X360 = 57.6
degree.
The details of the same data represented in a Pie Chart is illustrated in the Excel working
sheet unit2_workbook_ex1 is, shown as Fig. 2.2.
Fig. 2.2: Pie Chart

a aE

Diet Coke
26%

The Pie Chart gives us more interesting results since we can easily form an idea of the
proportions just by looking at the chart, e.g. 26 % for Diet Coke. In the pie chart, we first draw
a circle to represent all the data. Then we use relative frequencies to subdivide the circle into
sectors, or portions that correspond to the relative frequency of each class. Then, convert
the relative frequency into their corresponding angle by using the formula

Angle =Relative frequency X 360.


Using this angle, we must plot the sector on the circle. Similar calculations for each class of
the table yield a pie Chart.

Pareto Chart

The frequency of each category is plotted as a vertical bar in the descending order of
frequencies and combined with a cumulative percentage line on the same chart as shown in
the Fig. 2.3.

26
OPMCO001
Business Statistics

ee Fig. 2.3: Pareto Chart NOTES


44 Pareto Chart

IRR E
TERRELL
Diet Coke Dr. Pepper Coke Classic Sprite pepsi

Step for Visualizing Pareto Chart in Excel:


Step 1: Select the categorical Data Set
Step 2: Click Insert -> go to chart group -> click insert statistic chart -> from Histogram
dropdown box, and choose “Pareto”.

2.5 VISUALIZING NUMERICAL VARIABLES


The simplest way to visualize the Numerical Variables or the Quantitative variable is to
present the data in form of a graph. The following types of graphs are most suitable to
visualize:

1. Dot Plot/Scatter Plot

2. Histogram
3. Cumulative Distributions/Ogive

Dot Plot/Scatter Plot


One of the simplest graphical summaries of data is a Dot Plot and a scatter diagram. In the
scatter diagram we take the midpoint of the class on the x— axis and the frequency along the
Y-axis. The scatter plot gives a visual picture of the trend of the data. The mid point of the
class is calculated as the sum of the upper limit of the class and lower limit of the class divided
bytwo.

We are using the Frooti case (Unit2_ workbook_ex2) for developing the dot plot and scatter
diagram. In excel following step must be followed:

Step 1: Select the cells.


Step 2: Click the Insert tab on the Ribbon.
Step 3: Inthe chart group click scatter. 27
Step 4: When the list of scattering diagram subtypes appears pick as per your choice.
Fig. 2.4: Scatter Plot

Scatter Plot

OOOO
BR wa
FF
I

Histogram
Histogram is a commonly used graphical representation of quantitative/numerical data. A
histogram is constructed by placing the variable of interest on the horizontal axis and the
frequency, relative frequency and percent frequency on the vertical axis as shown
in the Fig.
2.5.

Most import e of
a histogram is to provide informati abn t the shape of the dat
If the shape oihetas togram is iushwamediicetes camera whe let ttchoelocs
set is tilted toward right than we call it a right- ieee aie . Best data set is the
symmetrical shape of a histogram or sana curve. asad nes us the fda picture
of the data. It also helps in deciding whether the distribution is normal or not.
Fig. 2.5: Histogram

Histogram

9
8
7?
6
5
4
3
2
1
9 Dol
146-151 151-156 156-161 161-166 166-171
A histogram for the Frooti data can be made using Excel as follows:
Step 1: Select the class data set.
Step 2: Press the CTRL key and select the frequency data set.
Step 3: Click the Insert tab on the Ribbon.

Step 4: In the chart group, click Column/2D column, then chart layout 8.
Step 5: Select the bars, right-click, and choose Format Data Series.

Step 6: On the Format Data Series pane, set the Gap Width to zero.
The cumulative Percentage Polygon
This visual presentation is also known as ogive. This is just like a scatter diagram in which
Midpoint of the class is on the x-axis and percent cumulative frequency along Y-axis.
Cumulative frequencies are of two types, one called less-than-cumulative frequency and
other called more-than-cumulative frequency.

Let us try to understand with the table given below:


Table 2.8: Cumulative Frequency
Class Mid-point (x}] Frequency | Less than More than
CF CF
146-151 | 148.5 l 20
151-156 | 153.5 0 19
156-161 | 158.5 9 10 19
161-166 | 163.5 8 18 10
166-171 | 168.5 2 20 2

Inthe table above, we are able to notice at one glance how many packs are having weightless
than 151, and itis 1. In the next column, we find how many packs have weight less than 156,
itis 1 agaln. Similarly, we know the number having weight less than 161 as 10 and so on. The
next column contains the number of packs with weight more than 146, and soon.

We can plot the CF (cumulative frequency) distribution chat or even the percent cumulative
distribution as shown in Fig. 2.6.

Fig. 2.6 : CF Distribution


25

vt 7 |_
145 150 155 160 165 170
29
UNIT 2
Organizing and Visualizing
Variables

2.6 VISUALIZING TWO NUMERICAL VARIABLES


Best way for visualization of two numerical variables together is through a scatter plot. A
scatter plot explores the relationship between two numerical variables. It also provides
insight about the correlation between the two variables.
Let us try to understand with the help of the following example:

Suppose that we want to introduce an alternate irrigation equipment in Rajasthan. For that
purpose, we collected the secondary data pertaining to yearly rainfall from meteorological
department database and annual Paddy production data from the agriculture ministry site
as:
Ex-3

Year Rainfall (in MM) Paddy Production (in MT)

1 123 678

2 134 654

3 156 698

4 167 643

5 134 690

Fig. 2.7: Scatter Plot of Rainfall Vs Paddy Production

Scatter Plot of Rainfall Vs Paddy


Production

S
P—

=}be
=)
>
a
o
io
(
be
[a
fa}
<r
[os
60 80 100 120 140 160 180

RAINFALL (IN MM)

Kindly refer the Unit2_workbook_ex3. We took the rainfall data along x-axis and Paddy
production data along Y- axis, and the Fig. 2.7 is the scatter plot of the two variables.
From the shape of the graph, itis obvious that there is no correlation.

30
OPMC001
Business Statistics

2.7. ORGANIZING AND VISUALIZING A MIX OF VARIABLES


We are going to introduce some of the newer methods that enable us to manipulate
interactive tabular and visual summaries of the Mix of Variables. Contingency table is a way
of organizing and side-by-side Bar chart is the method of visualization.

In excel we use Pivot Table for this purpose. A detail tutorial is attached as the endnote.

2.8 THE CHALLENGE IN ORGANIZING AND VISUALIZING VARIABLES

Creating false impressions is a major challenge in organizing and visualizing data. As we


organize and visualize variables, we must be careful not to create false impressions that
could affect preliminary conclusion about the data. Improper summarization and
visualization create false impressions.

Information overloading, presenting too many details can hamper our decision-making. The
over use of data confuses the decision-makers and is known as obscuring data.

Some people add decorative elements to enhance or replace the simplicity of the chart and
Graphs which further provides a false impression known as a Chart junk.

Best Practices for Visualization:

e Usethesimplest possible graph/chart.


e Provide title and label.

e Usethe axis title properly.


e Avoid using uncommon charts like radar bubble pyramid etc.

2.9 LETUSSUM UP
We are trying to sum up the discussion with the help of a flow diagram (which is also a type of
visualization tool) as

Data

| |

Categorical Numerical
Data Data

Tabular Methods Tabular Methods


{Frequency Distribution, Graphical Methods {Frequency Distribution, Graphical Methods

eee:
Comulative frequency Distribution}
| | posiamoey | | eee ||
Commulative frequency Distribution)
Steer
This chapter introduces the role of statistics in turning data into information. Businesses use
statistics to summarize and draw conclusions from data, to make reliable forecasts, and to
improve business processes. The chapter discusses data collection and the various types of
data used in business. We also learned how to draw tables and charts that are appropriate
for categorical and numerical variables and to draw conclusions from them. Pie charts, 31
UNIT 2
Organizing and Visualizing
Variables

histograms, and other graphical methods that enable decision-making were discussed.

2.10 KEYWORDS

Frequency distribution: A tabular summery of data showing the number of data values in
each of the several non overlapping classes.

Cumulative frequency distribution: A tabular summary of numerical data showing the


number of data values that are less than or equal to the upper class limit of each class.

Some the key words in this unit is Bar Chart, Pie chart, Relative Frequency.

2.11 SELF-ASSESSMENT QUESTIONS

Q.1 Asurvey of 150 executives were asked what they think is the most common mistake
candidates make during job interviews. Six different mistakes were provided. Which of
the following is the best for presenting the information?

a) Abar chart

b) Ahistogram

c) Astem-and-leaf display

d) Acontingency table
Q.2 You have collected information on the market share of 5 different search engines used
by Indian Internet users in a particular quarter. Which of the following is the best for
presenting the information?

a) Apie chart

b) Ahistogram

c) Astem-and-leaf display

d) Acontingency table

Q.3 You have collected information on the consumption by the 15 largest coffee-
consuming nations. Which of the following is the best for presenting the shares of the
consumption?

a) Apie chart

b) APareto chart

¢) Aside-by-side bar chart

d) Acontingency table
NOTE: Even though a pie chart can also be used, the Pareto chart is preferable for separating
the “vital few” from the “trivial many”.

Q.4 You have collected data on the approximate retail price (in $) and the energy cost per
year (in $) of 15 refrigerators. Which of the following is the best for presenting the
data?
32
OPMCO001
Business Statistics

a} Apiecharts
b) Ascatter plots

c) Aside-by-side bar chart


d) Acontingencytable

Q.5 You have collected data on the number of Indian households actively using online
banking and/or online bill payment over a 10-year period. Which of the following is
the best for presenting the data?

a} Apiechart
b) Astem-and-leaf display

c) Aside-by-side bar chart

d) Atime-series plot
Q.6 You have collected data on the monthly seasonally adjusted civilian unemployment
rate for the India over a 10-year period. Which of the following is the best for
presenting the data?
a) Acontingencytable

b) Astem-and-leaf display
c) Atime-series plot

d) Aside-by-side
bar chart

Q7 You have collected data on the number of complaints for 6 different brands of
automobiles sold in the India over a 10-year period. Which of the following is the best
for presenting the data?
a) Acontingencytable

b) Astem-and-leaf display
c) Atime-series plot

d) Aside-by-side bar chart


Q.8 You have collected data on the responses to two questions asked in a survey of 40
college students majoring in business—What is your gender (Male = M; Female = F)
and What is your major {Accountancy = A; Computer Information Systems = C;
Marketing=M). Which of the following is the best for presenting the data?

a) Acontingencytable

b) Astem-and-leaf display
c) Atime-series plot

d) AParetochart

33
UNIT 2
Organizing and Visualizing
Variables

Q9 Categorical data can be graphically represented by using:


a) histogram

b) frequency polygon
c) ogive

d) barchart
Q.10 Thesum ofthe percent frequencies for all classes will always equal
a) One

b) Thenumber of classes
c) Thenumber
of items inthe study

d) 100

CHECK YOUR PROGRESS

SCENARIO1
A sample of 200 students at a XYX university was taken after the midterm to ask them
whether they went bar hopping the weekend before the midterm or spent the weekend
studying, and whether they did well or poorly on the midterm. The following table contains
the result.

Did Wellin Mid-term Did Poorlyin Mid-term


Studying for Exam 80 20

Went Bar Hopping 30 70

Q.1 Referring to Scenario 1, of those who went bar hopping the weekend before the mid-
terminthe sample, percent of them did well on the mid-term.
a) 15
b) 27.27
c) 30
d) 55
Q.2 Referring to Scenario 1, of those who did well on the mid-term in the sample,
percent of them went bar hopping the weekend before the mid-term.
a) 15
b) 27.27
c) 30
d) 50
Q.3 Referring to Scenario 1, percent of the students in the sample went bar
hopping the weekend before the mid-term and did well on the mid-term.
a) 15
OPMC001
Business Statistics

b) 27.27

c) 30
d) 50
Q.4 Referring to Scenario 1, percent of the students in the sample spent the
weekend studying and did well on the mid-term.
a} 40
b) 50
c) 72.72

d) 80

Q.5 Referring to Scenario 1, if the sample is a good representation of the population, we


can expect percent of the students in the population to spend the weekend
studying and do poorly onthe mid-term.
a} 10
b) 20
c) 45
d) 50
Q.6 Referring to Scenario 1, if the sample is a good representation of the population, we
can expect percent of those who spent the weekend studying to do poorly on
the mid-term.
a) 10
b) 20
c) 45
d) 50
Q.7 Referring to Scenario 1, if the sample is a good representation of the population, we
can expect percent of those who did poorly on the mid-term to have spent
the weekend studying.
a) 10
b) 22.22
c) 45
d) 50
2.8 Inacontingency table, the number of rows and columns
a} mustalways be the same
b) mustalways be 2
c) mustaddto 100%
d) Noneofthe above

35
UNIT 2
Organizing and Visualizing
Variables

2.12 ANSWERS TO SELF-ASSESSMENT QUESTIONS


Q1

Q.2
Q.3

ot
4
a5 or
lo

Q.6
oa

Q.7
Qa

Q.8

Q9
Q.10

2.13 CHECK YOUR PROGRESS- POSSIBLE ANSWERS

Q1

Q.2

Q3
Q4

Q5
o2oTr

Q.6
oe

Q7
0.8

36
OPMCO001
Business Statistics

NUMERICAL DESCRIPTIVE
WLAN OLS

STRUCTURE
3.0 Objectives
3.1 Introduction
3.2. Central Tendency
3.3. Variation and Shape
3.4 Exploring Numerical Data
3.5 TheCovariance andthe Coefficient of Correlation
3.6 LetUsSumUp
3.7 KeyWords
3.8 Self-Assessment Questions
3.9 Check Your Progress - Possible Answers
3.10 Answers to Self-Assessment Questions

3.0 OBJECTIVES

After reading this unit, you will be able to:


learn about types of statistics
use the measures of location (descriptive statistics}

understand the measures of variability


understand the Covariance and the Coefficient of Correlation

use Excel for Descriptive Statistics

3.1 INTRODUCTION

In unit 2 we discussed tabular and graphical presentations used to summarize data. In this
unit, we will discuss the numerical measures of location that provide the additional
alternative for summarizing data. In this unit numerical measure of locations, dispersion,
shape, and association are introduced. If the measures are computed for data from a sample,
they are called sample statistics. If the measures are computed for data from a population,
they are called population parameters. In statistical inference, a sample statistic is referred to
37
UNIT 3
Numerical Descriptive Measures

NOTES as the point estimator of the corresponding parameter.

3.2. CENTRALTENDENCY

Let us consider the simple case of an MCD company where Mr. Ram Kumar is the Chief
Administrative Officer. He has provided data pertaining to the employee salary as follows:
Mean the salary is INR 10,000, and the median salary of the company is INR 5,000.

His queries are as follows:


a) —_Isthis company doing business normally?

b) _—_ Isthere any possibility of a strike or lockout in this company?

c) Are maximum employees demotivated? if yes, provide the reason.


Now, we try to learn data summarization and summaries of data which may be tabular,
graphical, or numerical - referred to as Descriptive Statistics.
Many cases require data about the large group of elements. But due to time, cost and other
consideration data can be collected froma small portion of the group.

The total number element ina study is called the population.


In our above MCD case, the population(N) is the total number of employees in MCD. The
population is represented by a symbol N, and the sample is a small subset of the population,
represented bya symbol n.
The process of collecting the data from entire population (N) is called 'census'. The process of
collecting the data from a small set of population (sample/n) is called ‘survey’.
The most important measure to understand the central tendency of the Data is called the
‘Mean' or the average value for a variable. The mean is of two types, based on the size of the
data. If we have population data than the mean is called the population mean. The
population mean is denoted by the Greek symbol p (read as mu) and the mean of the sample
data is represented by the symbol X.

The statistical formula for calculating


the sample meanis:

_
X= —
yx andforthepopulationmean p= =
Ex
n N

Greek letter } is summation sign.


yof the salary of all the employees is Rs 10,000, which implies that we have added the salary
data of all the employee of the organization and divided by the total number of employee of
the organization.

38
OPMCO001
Business Statistics

Now, let us consider a 10-employee salary sheet (Refer excel sheet NOTES
unit3_workbook_worksheet1)

Emp. no.| Salary (INR)


3,456

mB) ww) Ny] eB


5,689
6,784
7,890
7,590
| oO |

8,945
10,345
OOS

7,698
7589
©

9876
eS
o

Here total number of sample size is 10 and we have calculated the sample mean by adding
the total salary and dividing by 10.
This can also be calculated by using the Excel function

= AVERAGE (D4:D13).
Alternately, perform the following steps:

Step 1: take the sum of all the salary by using function = SUM ()
Step 2: divide the sum by n=10.

The calculation of meanis simple. The arithmetic mean is the least affected by fluctuations of
sampling. However, in extremely skewed distributions, the arithmetic mean is not the
suitable measure for statistical analysis.

WEIGHTED MEAN
Weighted mean is calculated by multiplying the weight/frequency with the value and then
adding it together and dividing the total with the total number of weight or frequency.

For example, ifin a store the sales data for shoes is as follows:

SHOES DATA
Shoe Size | Cost of each pair| No. of shoes
4 300 20
5 235 34
6 346 23
7 450 45
8 569 56
9 650 65

= SUM PRODUCT (N7:N12, M7:M12)/SUM (N7:N12)


39
UNIT 3
Numerical Descriptive Measures

In the above example we calculated the Mean by multiplying the weight i.e. The number of
shoes of different size with their respective cost per pair and then added the product, then
divided the total sum of the products by the total number of shoes in the store.
The mean costis Rs 478.6502058.

For understanding “Geometric Mean’ and Harmonic Mean’, you can browse the site
www. mathsisfun.com

MEDIAN

The median is also a measure of the central location. Median is the middle value in the data
set—the central point as far as the values are concerned.

For example, if we have a salary data set of 100 and the median value is Rs 5,000, it means
that there are 50 employees in this organization with salary more than Rs 5,000, and 50
employees with the salary less than Rs 5,000.

For calculation, the median following step to be followed:


1.‘ Arrange the data in the ascending order. The median is the middle value.
2. For even number of observations, the median is the average of the two-middle value.

Let us now try to do the exercise on excel Unit3_workbook_worksheet2

45
67
234
234
345
345
432
456
567
567
567
675
678
789
876
876
987
6789
We have total 18 items of data is in the above table. We used the Excel Sort Function for
arranging them in ascending order. Dividing 18 by 2, we obtain an integer9. So, the median is
the average of the values at 9" and 10" position from the top, which in this case is 567. If the
total number of elements had been 19, the mid-point would have been 9.5, and we would
have considered the value at the 10" position.
OPMC001
Business Statistics

We refer to the quartiles - the first quartile, or 25" percentile (Q1), the second quartiles or
50" percentile (Q2), the third quartiles or 75" percentile (Q3).

The formula for finding out the position of the data as their percentile is

i= (50) where pis percentile, nisthe sample size

The 75" percentile of the above data set

. 7,75 as
i= (—)x18 = 13.5 (14th position data)
100
The 75" or Q3 is therefore 789

In case of frequency distribution, the formula for calculating the median is:

Median =Me=I+h refy

Here l=lower limit of the median class


h=length of the class interval -1

n= frequency of the median class


Cf= cumulative frequency of the previous class to the median

f= frequency of the median class


The formula calculates the middle of the median class
Let us consider a frequency distribution
Table 3.1: Frequency Distribution of Shareholding month wise.

No. of Months of holding No. of share holders cf

1 Less than 2 4 4
2 2to4 7 11
3 4to6 10 21
4 6to8 12 33
5 8 to 10 14 47
6 10 to 12 6 53
7 12 to 14 15 68
8 14 to 16 13 81
9 16 to 18 1 82
Total 82
Now the total frequency
in this data set is 82, and n/2 is 82/2 =41
The cumulative frequency just greater than 41 is in the class 8 to 10, so our median class is 8
toclass 10.
41
UNIT 3
Numerical Descriptive Measures

Now 1=8, h=2,n/2=41,


f=14, andcf=33
Then Median = 8+2 X ((41-33)+14) = 9.142 (refer Unit3_Workbook_worksheet3}

Median is the best central tendency or the best statistical tools to understand the central
tendency of the data set.

MODE
Third measure to understand the central tendency is Mode. The value of the mode is
calculated using following formula:

Mode =[+h
Go
— fi)
@Xfo-fi-f)

Here |—lower limit of the modal class


h—class interval
f,-frequency of the modal class
f,- frequency of the class preceding
to the modal class

f,- frequency of the class succeeding to the modal class


Let us try to understand the Mode calculation with the help of the example:

Refer Unit3_Workbook_worksheet3
Table 3.2: Frequency Distribution of Shareholding month wise.

No. of Months of holding No. of share holders cf

1 Less than 2 4 4

2 2to4 7 11
3 Ato6 10 21
4 6 to 8 12 33
5 8 to 10 14 47
6 10 to 12 6 53
7 12 to 14 15 68
8 14 to 16 13 81
9 16 to 18 1 82
Total 82

Here the maximum frequency class is 12 to 14 and


|= 12, h =2, f,= 15, f,= 6, f,= 13
Now Mode = 13.63

Mode is extensively used by the industries.

42
OPMCO001
Business Statistics

CHECK YOUR PROGRESS - | NOTES


Q.1 Growth factor of population in Ranchi in the past two years has been 8 and 12. The
geometric mean is

a) 20
b) Square root of 20

c} Square root of 96
d) 96

3.3. VARIATION AND SHAPE

Variation and shape are one of the measure characteristics of the dataset, which give the
idea about the spread or dispersion of the value. The range is the simple measure of
variation. The range gives size of the basket of the dataset.

The Range

The Range is represented by a symbol R, and the difference between the largest and the
smallest value of the data set is called Range.

Range ® = largest value


- smallest value
Let us take the same case of MCD company, if the highest paid salary is Rs 1,00,000 and the
lowest paid salary is Rs 2,000, then the range is

R=Rs 1,00,000 - Rs 20,000 = Rs 80,000.


In excel to find the largest value we use =MAX {) function and to find the smallest value, we
use = MIN () function. The total spread of the set of the salary data of the MCD company is Rs
80,000.
Variance and Standard Deviation

The range only states the spread of the data set. We are interested to know the distribution
of the data in the data set. We want to know how the data is scattered from the centre point.
This information we can get only when we will subtract the data value from the centre point.
The difference between the centre points and data value is called the variation, some time
the variation is positive and sometimes it is negative.

For example, ifthe central tendency of the salary in the MCD case is Rs 5, 000.
Suppose employee A salary is 3000, then the variation is:

V= Rs 3,000 - Rs 5,000 =- Rs 2,000,


Now suppose another employee B has a salary of 7000 than the Variation is:
V=Rs 7,000 - Rs 5,000 = Rs 2,000.
This illustration states that the variation can be positive as well as negative. Total variation is
the sum of all the variation inthe data set:

TV=5X, =0, because they cancel out. 43


UNIT 3
Numerical Descriptive Measures

To eliminate the effect of negativity, we have conceptualized a new variable to measure the
absolute variability.

"Variance and standard deviation’: These statistical measures the average deviation of data
around the mean, it gives the picture about the fluctuation data above it and belowits value.
The process of calculating the sample Variance (S’) and sample standard deviation is:

Step 1: Find the difference between each value and the mean
Step 2: Square each difference.

Step 3: Sum the squared difference.


Step 4: Divide this total by n-1 to find the sample variance.

Step 5: Take the Square root of the sample variance to get the sample standard deviation.
For finding the variance and standard deviation of the frequency distribution, the formula
used is

52 = Xf M? —n(p)
7 n—-1

Here M is the midpoint of the class {2 is mean of the data set, and n is the total frequency.
The Square root of the variance is the standard Deviation.

Refer the illustration unit3_workbook_worksheet3.


Standard deviation of the population is represented by the symbol @ read as sigma. The
standard deviation of the sample is represented bya symbols.
The formula for the population standard deviation is:

X(xi —pw)r2
sigma = sorte SW)

The objective of converting the variance into standard deviation is to measure the deviation
of the data from the central tendency in the same unit.

3.4 EXPLORING NUMERICAL DATA

We can explore the numerical data by calculating the central tendency, variation and shape.
We can also visualize the distribution of numerical variables by calculating the quartiles, five-
number summary and constructing a box plot.

In the previous section, we learnt the central tendency and the method of calculating the
Quartiles.
Here we will try to understand 'what is the percentile?', and the interquartile range first than
we will see the Five-Number Summary.
Percentile: It is related to the quartiles. The percentiles that are split into 100 equal parts. By
OPMC001
Business Statistics

this definition, the first quartile is equivalent to the 25" percentile, the second quartile to the
50" percentile and the 3” Quartile is the 75" percentile.
The Interquartile Range
The difference between the third quartile and the first quartile is called the interquartile
range.
Interquartile range =Q3-Q1
The Five-number Summary

The five- number summary of a data set is smallest value, the first quartile, the median, the
third quartile, and the largest value.

Five- number summary:

Xematiesr Q1, Median, Q3, Xrrsests


Let us try to understand how the Five-number Summary is helping in exploring the numerical
data.

Let us take five-number summary of a data set as:


28.36, 40, 44,51

The distance from X, ,aiextO the median (40 - 28 = 12) is slightly more than the distance from
the X,,, «to the median (51-40 = 11). Therefore, the distribution is slightly left-skewed.

The Boxplot
To visualize the shape of the distribution of the five-number summary, a Boxplot is used. It is
avertical line on the horizontal axis.

3.5 THECOVARIANCE ANDTHE COEFFICIENT OF CORRELATION


In this section we will understand two measures of the relationship between two numerical
variables.
a) Thecovariance

b) Thecoefficient of correlation
The Covariance
One can understand the strength the linear relationship between two numerical variables (X
and Y) in an equation by measuring the covariance.
The formula for measuring the covariance is

Cov(xand y)=
yi - ¥)
n—-1
The covariance does not expressly state the relative strength of the relationship. It mainly
states the total product of individual variations. We cannot estimate whether the resultant
value indicates a strong, weak, or negative relationship and so we have coined a variable
45
UNIT 3
Numerical Descriptive Measures

called the coefficient of correlation.

The coefficient of correlation measures the relative strength of the linear relationship
between numerical variables. The sample's coefficient of correlation is represented by the
symbol ‘r’, which range from - 1 for perfectly negative correlation and + 1 for the perfectly
positive correlation. When dealing with the population the coefficient of correlation is
represented by the symbol p (rho).
Sample coefficient of correlation (r) is calculated as:

_ cov(x,y) to measure the total variances of the products of the variables

divided by the product of individual variances.

Where cov (x, y)=


Die i-Mi-¥)
n-1

g = feces)?
a n-1

n =
ei i-¥)?
Sy- n-1

Let us look ata small case study.

The manager of a factory wants to predict how many extra conference kits to prepare ona
day when the conference is held at the IMTCDL. A random sample of records of the last few
years is as follows:
OPMC001
Business Statistics

Table 3.2: Conference Data

No. of Conference registration No. of extra kits


126 103
367 134
213 123
151 56
178 100
270 135
301 156
345 176
169 78
111 67
218 126
234 111
245 178
256 113
123 121
345 128

In this case, we must evaluate r first.


We can use the Excel function
= CORREL (C5:C20, D5:D20)
r=0.69.
This therefore indicates a strong correlation

Population Correlation

p= Xe w(K wy)
Ctr ux? >. — Hy)’
i=1

3.6 LETUSSUMUP
In this unit, we introduced several basic elements of descriptive statistics, helpful in solving
business problems. We have also discussed the utilities of the central tendency, variance,
and correlations.

47
UNIT 3
Numerical Descriptive Measures

3.7 KEY WORDS


Mean: The average value of the data set.

Median: The exactly middle point of the data set.


Mode: The value that occurs with the greatest frequency.

3.8 SELF-ASSESSMENT QUESTIONS


Q.1 Consider asample with data values 27, 25, 20, 15, 30, 34, 28, and 25. Compute the 20",
25", 65" and 75" percentiles.
Q.2 Compute the Geometric Mean and Harmonic mean of the following Dataset.
34, 67, 56, 13, 25, 65, 43, 27
(Use the reference material for understanding about the Harmonic mean and
Geometric mean)

Write True or False


Q.3 The value of the standard deviation may be either positive or negative, while the value
of the variance will always be positive.

Q.4 Thedifference between the largest and smallest observations in an ordered data setis
called the range.

Q.5 Thestandard deviation is expressed in terms of the original units of measurement, but
the variance is not.
Q.6 Thedataset 10, 20,30 has thesame variance as the data set 100, 200, 300.

Multiple Choice Questions


Q.1 Whichofthe following statistics is a measure of central location?

a. Themean
b. Themedian
c. Themode
d. Allthesechoices aretrue.

Q.2. Which measure(s} of central location is/are meaningful when the data are ordinal?
a. Themeanand median

b. Themeanand mode

c. Themedianand mode

Only mean
A.
OPMC001
Business Statistics

Q3 Which of the following statements about the mean is not always correct?
a. Thesumofthe deviations from the mean is zero.

b. Halfofthe observations are on either side of the mean.


c. Themeanisameasure of the centrallocation.

d. The value of the mean times the number of observations equals the sum of all
observations.
Q.4 Which of the following statements is true for the following observations: 9, 8, 7, 9, 6,
11 and 13?
a. Themean, median, and mode are all equal.

b. Onlythe mean and median are equal.


c. Onlythemeanand modeare equal.
d. Onlythe median and mode are equal.

Q5 Which of these measures of central location is not sensitive to extreme values?


a. Themean

b. Themedian
c. Themode

d. Allthese choices are true

Q.6. Inapositively skewed distribution:


a. themedianequalsthe mean.

b. themedianisless than the mean.


c. themedianislargerthan the mean.

d. themeancanbe larger or smaller than the median.

Q7 Which of the following statements about the median is not true?

a. Itismore affected by extreme values than the mean.


b. It isa measure of central location.
c. Itisequalto Q,.

d. Itisequal tothe mode ina bell-shaped distribution.

8 Which of the following summary measures is sensitive to extreme values?

a. Themedian
b. Theinterquartile range

c. Themean
d. Thefirst quartile

49
UNIT 3
Numerical Descriptive Measures

Q.9 Ina perfectly symmetric bell shaped "normal" distribution:


a. the mean equals the median.

b. the median equals the mode.


c. the mean equals the mode.

d. Allthese choices are true.

Q.10 Which of the following statements is true?

When the distribution is positively skewed, mean < median < mode.

b. When the distribution is negatively skewed, mean >median > mode.


When the distribution is symmetric and unimodal, mean = median = mode.

When the distribution is symmetric and bimodal, mean = median = mode.


2

3.9 CHECK YOUR PROGRESS - POSSIBLE ANSWERS

Q1 c

3.10 ANSWERS
TO SELF-ASSESSMENT QUESTIONS

Q1 Reference

Q.2 Reference

Q3 False

Q.4 True

Q5 True

Q.6 False

Multiple Choice Questions

Q1 d
Q.2
orornegea

Q.3
Q4

Q5

Q6

Q7
»o

Q8
a

Qag

Q.10
OPMCO001
Business Statistics

Case Study:
Ages of Senior Citizens

Asociologist recently conducted a survey of citizens over 65 years of age whose net worth is
too high to qualify for Medicaid, and who have no private health insurance. The ages of 22
uninsured senior citizens were as follows: 65, 66, 67, 68, 69, 70, 71, 73, 74, 75, 76, 77, 78, 79,
80, 81, 86, 87,91, 92,94, and 97.
Q.1 Calculate the mean age of the uninsured senior citizens

Ans. X =78.0years
Q.2 Calculate the median age of the uninsured senior citizens.

Ans. 76.5 years

Q.3 Explain why there is no mode for this data set.


Ans. There isno mode because every age is different.

51
BASIC PROBABILITY

STRUCTURE
4.0 Objectives
4.1 Introduction
4.2 Basic Probability Concepts
4.3. Conditional Probability
4.4 Ethical Issues and Probability
4.5 Bayes'Theorem
4.6 LetUsSumUp
4.7. KeyWords
4.8 References and Suggested Additional Readings
4.9 Self-Assessment Questions
4.10 Answers to Self-Assessment Questions
4.11 Check Your Progress-Possible Answers

4.0 OBJECTIVES
After reading this unit, you will be able to:

® understand the basic probability concepts


e understand conditional probability

® understand Bayes' theorem

4.1 INTRODUCTION

Inthe unit 1, 2 and 3, we learned the various ways of defining, organizing, and visualizing and
analyzing the data to provide a sensible picture for business decision making. In this unit, we
will understand business practices wherein Business managers talk about uncertainty or
likelihood for the sales of a particular product or service in their geographical locations in
terms of either in percentage or terms of probability (out of 1) to assess the market potential.
For example, what are the chances of the sale of FMCG goods in the U.P. market? What are
the chances for a student to clear the exam? What is the chance of getting head in a toss of a
52
OPMC001
Business Statistics

coin? Expressed as a fraction of 1, we call it probability. What is the chance of zero internet NOTES
drops per day? The principle of probability helps bridge the worlds of descriptive and
inferential statistics. Probability means the likelihood that an event will occur.

4.2 BASICPROBABILITY CONCEPT


Probability is a numerical measure of the likelihood that an event occurs. Once we are saying
that the chance of selling a car in the Delhi market is 30%, it means we are conveying that in
Delhi 30 people out of 100 will purchase the car. Now the same thing can be expressed as:
Likelihood of car sale in Delhi is 0.3 or the probability ofa car selling in the Delhi market is 0.3.
i.e. 30/100.

Thus, the concept of probability is used to measure the degree of uncertainty in a Business.
Probability values are always assigned on a scale of 0 to 1. A probability close to zero
indicates that the event is unlikely to happen. The probability close to 1 indicates a near-
certainty that the event will occur.
The probability is represented by a symbol P, for example, the probability of passing the exam
can be written as:
E = Passing the exam.

P(E)=.65
There are three types of probabilities
e =Apriori

e Empirical
e Subjective

4.2.1 APRIORI PROBABILITY


This is a type of probability, wherein the prior information or past information or knowledge
is available in the domain of knowledge. The business manager uses the prior knowledge or
the information from the body of knowledge and based on that he or she used to measure
the probability.

For example, if | ask you to measure the probability of days of January in the year 2020. Here
you have the prior knowledge that January is having 31 days and the year 2020 is of 365 days.

31
P (January) = 365

We calculated the probability of January days in the year 2020, we found by using the simple
calculation that in this year, out of 365 days the total number of days in January is 31.
Therefore, the contribution of January out of the year is 31/365, or, in other words, the
Probability of January days in 2020 is 0.0849. Probability P is always between 0 and 1. In this
example, we used prior knowledge to measure the probability therefore this type of
probability is called a priori probability.
53
UNIT 4
Basic Probability

NOTES 4.2.2 EMPIRICAL PROBABILITY

The probability which we use to calculate from the available dataset of the business is called
Empirical probability. The formula for measuring probability is:
PROBABILITY OF OCCURRENCE

Itis represented by the symbol “P”.


n
P= T
Where n=number of ways in which the event occurs

T=the total number of possible outcomes.


In “Empirical Probability”, the probabilities are measured on the observed data not on the
prior knowledge of the process. Let us try to understand with the help of the example, in a
class 70% of students are taking interest in statistics class, then we can say 0.70 is the
probability that an individual student has an interest in the statistics class.

4.2.3 SUBJECTIVE PROBABILITY


Subjective probability is an approach different from the above two. In this approach, an
outcome is based on a combination of an individual's experience, observations, trends, and
analysis of the situation. This approach is helpful only when a priori and empirical options are
not possible. This relies on a guess based on experience, for example; a company launched a
new product say football and it is trying to predict the chances of the sale in the West Bengal
market. It does not have prior knowledge because it is a new product. It has to use subjective
feels based on the experience of other similar products or competition. This is an example of
subjective probability.
Now let us try to understand
a few important terms used in inferential statistics.
Event:

Aneventis aset of sample points. For example, the days of January ina year are an event.
Probability
of an event

The probability of any event is equal to the sum of the probabilities of the sample points in
the event.
Some basic relationships of Probability

The complement of an event “Days of January” is defined to be the event consisting of all the
sample points that are not in January. The complement of January is denoted by (Days not of
January)°.
We can understand the Compliment of an event with the help of figure 4.1.

54
OPMC001
Business Statistics

Fig. 4.1:

;—— Sample Space

CT Event A

—_———|____
The complement of event A

In any probability application, either event E or its complement E° must occur.


P(E)+P(E)=1
Or, P (E) = 1- P(E)
Consider a company where 80% of the new customer contact details reflect no sales. Now if
the event of sale is represented by S, then Sis the complement of an events$, i.e. no sale.

In this example P(S‘) = .80.


Now using the above equation
P (S)=1-.80=.20
We say that a new customer has a .20 probabilities of generating a sale.

Addition Law:
Union of two events
The union of two events is all the sample points in an event A and event B added up i.e. it
is either A or B.

We illustrate with the help of this diagram


Fig. 4.2:

Gao»
The area shaded black and grey is AUB.
Let us try to understand with the help of an example.

What is the probability of January Days or Wednesdays of the year 2020?


If P (Jan) and P (WED) is the probability of January event and Wednesday event then P (Jan U
WED) is the probability of either January or any Wednesday of the year.

That is the Probability of January + Probability of Wednesday— Probability of Wednesdays


in January (to avoid double-counting because 52/365 below also includes January

Wednesdays.).
Thus, it is P (Jan) + P (Wednesdays) - P (Jan Q Wednesdays).
55
UNIT 4
Basic Probability

i.e. 31 + 365 +52 + 365 -4+ 365


= 0.22 (rounded)

This is expressed through the general equation:


P (AUB) = P (A) + P (B) - P (AQB).

The intersection of two events:


The common sample points which exist in both the events are called “Intersection A and
B” represented by the symbol AOB — the grey-shaded portion in the left diagram(also
known as Venn Diagram)

Intersection of two sets Union


of two sets
ANB AUB
If the two events are independent, the probability of the intersection point, i.e. both events
occurred are the product of the probabilities of the two events, i.e. P (A) x P (B).

Mutually exclusive events are events with no common sample points.


Let us try to understand with this example:

In the days of January and February in the year 2020, there is no common day between
January and February.
P (Jan) and P (Feb) are called mutually exclusive events.

So, P (Jan U Feb) =P (Jan) +P (Feb)

=31+365+28+365
=(31+28)+365
=59+365
=0.1616

56
OPMCO001
Business Statistics

4.3. CONDITIONAL PROBABILITY

The probability of Agiven B

P(A[B)= P (A and B)

The probability of B given A

P (A and B)
P (AIB)= ——~ pray
In the above formulae, the left-hand side is the conditional probability, and we read this as
the probability of A given that situation B has occurred.

Let us understand this with the help of a problem:


Suppose in Mumbai 2,000 students were visited to find out how many plan to seek admission
in PGDM through DLP. 300 are planning to seek admission to PGDM through DLP. Now after
two months, we find out that 200 have obtained admission. This is the conditional
probability of a student seeking admission given that they planned to doso.

P (admission | planned admission) =

Planned to take admission and actually taken admission


Planned to take admission

_ 200
300
= 0.66
0.66isthe conditional probability

Let us try to understand with the help of the following case study.
Consider the promotion status of male and female officers in the Education Department in
UP. The Education Department consists of 1,000 officers, 800 men and 200 women. Over the
past two years, 300 education officers obtained promotion; the breakdown for male and
female officers is shown in Table 4.1.

Table 4.1: Promotion Status

Men Women Total

Promoted 200 100 300

Not Promoted 600 100 700

Total 800 200 1000

57
UNIT 4
Basic Probability

Table: 4.1

After reviewing the promotion record, female officers raised a discrimination issue that 200
male officers had received promotions, but 100 female officers received the promotions.
The Education Department explained that the relatively low number of promotions for
female officers was not due to discrimination, but to the fact that relatively few females are
members of the Education Department.
Let us see how conditional probability could be used to analyze the discrimination charge.

Let
EM =eventan officer isa man
EW = event an officer isa woman
A=eventan officer is promoted

A‘=eventan officer is not promoted

Dividing the data value in Table 4.1 by a total of 1000 officers


P (EMQA) = 200 + 1,000 = .20 = probability that a randomly selected officer is a man and is
promoted.

Joint probability. | Men Women Total

Promoted ™ .20 .10 30

Not Promoted .60 10 70

Total .80 20 » 1.00


~
Marginal Probability

Let us calculate the probability that an officer has promoted given that an officer is a man. In
other words, the probability of the officer promoted from the men categories.

We will apply the conditional probability as:

P(A|EM)= > = 25
Similarly, the officer promoted from the women categories are:

P(AIEW) = "20 = 50
What conclusion do we draw?

The probability of the promotion of women from the “woman” category is just double the
probability of man promotion. The conditional probability calculation rejects the argument
58
presented by the female officers.

4.4 ETHICALISSUES AND PROBABILITY

Ethical issues related to probability arise when we express in terms of probability or chances
but the public gets confused and is not able to trust easily. This type of situation generally
arises when the probability information is related to advertisements.
To ensure that the probability information is ethically correct, we must cite the pertinent
data set or the source of knowledge - in the process analyze the facts.

Clarification about the probability-related statement with supporting data is important to


ensure transparency.

4.5 BAYES' THEOREM


This theorem was developed by Thomas Bayes in the Eighteenth Century.

Bayes' theorem is used to obtain revised or posterior probabilities i.e. reverse the
probabilities, given existing probabilities.
We start the analysis with initial or prior probability estimates for specific events of interest,
and then from sources such as a sample, a special report, obtain additional information
about the events. From the new information, update the prior probability values by
calculating revised probability, referred to as a posterior probability. Bayes' theorem
provides a method for probability calculations.
The flow of the probability revision process is shown in Fig. 4.3.

Fig. 4.3: The flow of probability revision as per Bayes’ theorem

Prior New “re cation Posterior


Probabilities Information y Probabilities
Theorem

Prior Probabilities: Initial probability estimates for specific events of interest.

Posterior Probabilities: Revised probability estimates based on the derived information.


Statement: If X,, X,,X, X, are mutually exclusive events with P (X,) >0; (I=1, 2, 3,
aneatsenseatsensenees n) then for any event Y, which is a subset of (X,,UX,,U ................... %,) for two
event case:
P(X1) P11)
PQG|Y) = P(X,)P(Y|X,) + P(X2)P
(¥|X2)

P(X2)P(¥|X2)
PQI|Y) =
P(X1)P(¥|X1) + P(X2) P7142)

Please note that we have derived Probabilities of X1 and X2 given that Y has occurred from
probabilities of Y given X1 and X2 have occurred respectively.
59
UNIT4
Basic Probability

Let us consider a case study.


Bayes' Theorem: Two Events Cases (Example)

A manufacturing company receives raw materials from two vendors, Ram and Mohan. 70 %
are ordered from Ram and 30% from Mohan. Further, 2% of parts coming from Ram are
defective, and 5% from Mohan are defective.

The machine breaks down because of processing a bad part. What is the probability that the
part came from Ram?

Prior Probability
P{A,)= .65&P(A,)=.35.
i.e. What is Posterior probabilities P(A,|B)=? &P(A,|B)=?

We use the Tree Approach and Tabular Approach to solve the problem.
Inthe above case, we have the information as the historical quality level of the two suppliers.

Percentage of good parts Percentage of bad parts

Ram (A1) 98 2
Mohan {A2) 95 5

Let G is the good parts and Bis the bad parts then the Probability of good parts from supplier
lis
P(G|A1)=.98 and P(B|A1}=.02

P(G|A2)= .95 and P(B|A2)=.05


We see that four experimental outcomes are possible; of which two correspond to the part
being good and two correspond to the part being bad.

Vendor P (X) P (D|X) P(x) x P(D|X)


Ram .70 02 .014

Mohan 30 05 .015

.029

Now, the probability that the defective part is from Ram is

P (x|D)= -024 = 4g
029
Therefore, we can say that the probability that the defective part is from Ram is .48 means
48%. (refer to the earlier equation)
OPMC001
Business Statistics

Tabular Approach:
Let us considera case study.

Ina piston factory, machines A, B, and C manufacture, respectively 30%, 36% and 34% of the
total piston production. Of the total output 10%, 5%, and 8% respectively are defective bolts.
One bolt is drowning randomly from the lot, what is the probability that it is manufactured in
machine C?
Solution:

Machine | Prior Probability | Conditional Joint Probabilities| Posterior


P{Ai) Probability P (A|B,)| P (Ai)x P(A|Bi) | Probability P (Ai]B)
A 30 1 .03 .03/.0752=.3989
B 36 05 018 .018/.0752=.2393
Cc 34 08 0272 .0272/.0752=.3617
Total .0752 1.0

The probability that the bolt randomly picked from Machine C is .3617 (36.17%).

4.6 LETUSSUMUP
In this unit, we introduced the concept of Probability and learned from the example how
probability analysis can be used for decision-making in Business. We learned that probability
was a numeric value between 0 and 1 that represents the likelihood or the possibility that a
particular event will occur. We also learned about complex aspects of Probability - like Joint,
Conditional, posterior probabilities, etc.

4.7 KEYWORDS

Probability: A numerical value of the chances that an event will occur.

Conditional Probability: The probability of an event given that another event already
occurred. The conditional probability ofA given B as P (A| B) =P (AQB)/P(B).

Joint Probability: The probability of two events occurring simultaneously, that is the
intersection of two events.

Marginal Probability: The values in the margins of a joint probability table that provide the
probabilities of each event separately.

Bayes’ Theorem: A method used to compute posterior probabilities.

4.8 REFERENCES AND SUGGESTED ADDITIONAL READINGS


Business Statistics, by David M. Levine (Author), David F. Stephan (Author), Kathryn A. Szab

4.9 SELF-ASSESSMENT QUESTIONS


Q.1 Of the last 500 customers entering a supermarket, 50 have purchased a wireless
61
UNIT 4
Basic Probability

NOTES phone. If the relative frequency approach for assigning probabilities is used, the
probability that the next customer will purchase a wireless phone is:

a. 0.10
b. 0.90
c. 0.50
d. Noneofthese choices.
Q.2 If Aand Bare mutually exclusive events with P (A) =0.75, then P (B):

a. canbeanyvalue between0Oand1
b. canbeanyvalue between 0and0.75

c. cannotbelargerthan0.25
d. equals0.25

Q.3 If you roll a balanced die 50 times, you expect an even number to appear:
a. oneveryother roll

b. exactly
50 times out of 100 rolls

c. 25timeson average, over the long-term


d. Allofthese choices aretrue

Q.4 An approach of assigning probabilities which assume that all outcomes of the
experiment are equally likely is referred to as the:

subjective approach
pb

b. objective approach
c. classicalapproach

d. relative frequency approach

Q5 The collection of all possible outcomes of an experimentis called:

a. asimpleevent

b. asamplespace
c. asample

d. apopulation

Q.6 If two events are mutually exclusive, what is the probability that one or the other
occurs?

a. 0.00

b. 0.50

c. 1.00
d. Cannotbedetermined from the information given
62
OPMC001
Business Statistics

Dy
Q.7 If two events are mutually exclusive, what is the probability that both occur at the
same time?

a. 0.00
b. 0.50
c. 1.00
d. Cannotbe determined from the information given

Q.8 Iftwo events are mutually exclusive and collectively exhaustive, what is the probability
that both occur?
a. 0.00

b. 0.50
c. 1.00
d. Cannot be determined from the information given

Q.9_ If the two events are mutually exclusive and collectively exhaustive, what is the
probabilitythat one or the other occurs?

a. 0.00
b. 0.50
c. 1.00

d. Cannot be determined from the information given


Q.10 If events A and B are mutually exclusive and collectively exhaustive, what is the
probability
that event A occurs?
a. 0.25

b. 0.50

c. 1,00
d. Cannot be determined from the information given

CHECK YOUR PROGRESS


Q.1 Asurvey of magazine subscribers showed that 45.8% rented a car during the past 12
months for business reasons, 54% rented a car during the past 12 months for personal
reasons, and 30% rented a car during the past 12 months for both business and
personal reasons.

What is the probability that a subscriber rented a car during the past 12 months
for business or personal reasons?

What is the probability that a subscriber did not rent a car during the past 12
months for either business or personal reasons?

63
UNIT 4
Basic Probability

Q.2 Assume that we have two events, A and B, that are mutually exclusive. Assume further
that we know P(A)=.30 and P(B)=.40.

a. WhatisP(AUB)?
b. Whatis P(AQB)?

c. What general conclusion would you make about mutually exclusive and
independent events given the results of this problem?

4.10 ANSWERS TO SELF-ASSESSMENT QUESTIONS

Q1 a

Q.2 c

Q3
Q4
a

Q5
7

Q.6
Qa

Q.7

08
Q9

Q.10

4.11 CHECK YOUR PROGRESS - POSSIBLE ANSWERS

Q1
Q.2
OPMC001
Business Statistics

DISCRETE PROBABILITY DISTRIBUTION

STRUCTURE
5.0 Objectives
5.1 Introduction
5.2 Definitions
5.3 Probability Distributions
5.4 Theimportance of Expected Value in Decision-Making
5.5 Binomial Probability Distribution
5.6 Poisson Distribution
5.7. LetUsSum Up
5.8 KeyWords
5.9 Case
5.10 Self-Assessment Questions

5.0 OBJECTIVES
After reading this unit, you will be able to:

e understand the properties of a probability distribution


e explain the difference between discrete probability distribution and continuous
probability distribution
® compute the expected value and variance of probability distributions
® compute probabilities of Binomial and Poisson distributions

5.1 INTRODUCTION

In this unit, you would be familiarized with the concepts and characteristics of probability
distributions. We shall learn about the special cases of probability distributions i.e. Binomial
distribution and Poisson distribution, their assumptions and applications using problems.

65
UNIT 5
Discrete Probability Distribution

NOTES 5.2 DEFINITIONS

Let us introduce some basic terms that we would be using in this unit.

We have already studied different types of variables in unit 1. Let us recapitulate the concept
of a variable. A variable is a characteristic, number, or a quantity that we are interested to
explore, e.g. age, country of birth, marital status, etc.

In this unit, we would be using the term random variable. What is the difference between a
variable and a random variable? In statistics, randomness has a pattern, like rolling an
unbiased dice, even though we are not aware of the outcome of the experiment, till the time
dices are rolled but we know that the outcome would be any one of the numbers between 1
to 6 only. So the outcome of rolling the dice is the random variable and not the variable. i.e.
when the value of a variable is the outcome of a statistical experiment, that variable is known
as arandom variable.
For more clarity let us now define a random variable;

The random variable is the one, whose range of outcomes are known in advance, but the
actual outcome will appear only after experimenting. The random variable is represented by
the letter x, and it represents a numerical value for every outcome in sample space. Just as
variables are divided into two parts - continuous and discrete, random variables are further
categorized as discrete random variables and continuous random variables.
Difference between Discrete and Continuous Random Variable

When our area of interest is to count the outcomes of the experiment, we use discrete
random variables. e.g. the number of customers expected at a petrol pump, the number of
children expected in a family,
the number of patients in an OPD, etc.

When our area of interest is to measure the outcome of the experiment which is not limited
to discrete or an integer values but can assume any range of values, we use a continuous
random variable. E.g. time taken to reach the airport, the percentage of impurity in a batch of
chemicals, the annual income of a player, etc.

Exercises based on Section 5.2

5.1 Explainthe term random variable.


5.2 Explain different types of random variables with suitable examples.

5.3. Categorize the following random variables as discrete or continuous.

a) x=interest rate of fixed deposits at bank


b) x=no. of spelling mistakes in one page

c) x=timetakento complete an assignment


d) x=n0.of defective bulbsina lot

e) x=no.ofgirlsinclass
f) x=temperature measured on different days ina month
66
OPMCO001
Business Statistics

g) x=no.ofcomputerssoldinaday NOTES
h) x=total daily sales of Amazon in India

5.3 PROBABILITY DISTRIBUTIONS

A probability distribution is the extension of probability. Probability is the chance of


occurrence of a single outcome of the experiment whereas the probability distribution is a
range of probabilities associated with each outcome of the experiment. Different data will
have different types of distributions.

Definition of a probability distribution


A probability distribution is a mathematical function that provides the probabilities of
occurrence for all different possible outcomes of an experiment.
Types of Probability distributions
Probability distributions are of two types, discrete probability distributions, and continuous
probability distributions, depending on whether they define probabilities for continuous or
discrete random variables.

Discrete Probability Distributions


A discrete distribution describes the probabilities of occurrence of each value of a discrete
random variable. A discrete random variable can be described either by probability mass
function (PDF) or by cumulative probability function (CDF). The PDF describes the probability
of each value of X, it calculates the probability of exactly x successes from n independent
trials, whereas CDF describes the cumulative sum of probabilities starting from the smallest
value of x to the largest value of X, it calculates the probability of there being at most x
successes from n independent trials. Fig. 5.1 and 5.2 depicts PDF and CDF respectively for a
discrete distribution. The graph of CDF will always be an increasing type as the previous value
pets added to the current value and converges to 1 finally.
Fig. 5.1: Graph of PDF

03
0.25
0.2
probability

0.15 7

0.1 —

0.05 ——__§ § @ 7 7 :

—— l.
012345
67 8 9 10

Values of X

67
UNITS
Discrete Probability Distribution

Fig. 5.2: Graph of CDF

probability

01234567

Values of X

Continuous Probability Distributions


A continuous distribution describes the probabilities of the occurrence of each value of a
continuous random variable. The probabilities of continuous random variables (X) are
defined as the area under the curve of its PDF. Thus, we can calculate only those values which
lie within a range. The probability that a continuous random variable equals some value is
always zero.
Table 5.1: Some examples of probability distributions

Sr. no |Decision problem Discrete Random variable range

1__|No. of absenteeism in a class of 60 X = {0, 1, 2, ....60}

2 |ABCis a supplier of cloth to Arvind mills. The |X = {0, 1, 2, ....... }


manager at Arvind mills will reject the lot if it
has more than 10 defects per 20 meters.

3 |The number of claims received by LIC ina x= {0, A, 2ysiioms n} where nis no.
: of policy holders
particular year.

4 _|Inagame of gambling, you win Rs 10 if the X={5, 0, 10}


outcome of the dice is greater 4, you lose Rs 5
if the outcome of the dice is less than 3.

5 |The number of wrong calls you receive in an X= {0, 1, 2,....... }


hour.

6 |No. of heads when three unbiased coins are | X = {0, 1, 2, 3}


tossed simultaneously.

68
OPMC001
Business Statistics

Further to make it clear let’s expand example 6 and estimate the probability distribution.

Three unbiased coins are tossed simultaneously. So the sample space is S = {TTT, HTT, THT,
TTH, HHT, HTH, THH, HHH}.
Let X represents no. of heads; the probability distribution is represented in Table 5.2

Table 5.2: Probability distribution of no. of heads

Outcomes X | P(X)

TIT Q |0.125

HTT, THT, TTH 1 |0.375


HHT, HTH, THH, 2 |0.375
HHH 3 |0.125

tossing of 3 coins
simultaneously
0.4
0.35
03
= 025
-m=| 02
9 0.15
a 01
0.05
0
0 1 2 3

no. of heads

Before we discuss further let us look at the properties of discrete probability


distributions.The probability of all the outcomes lies between 0 and 1 inclusive of both. All
the outcomes of the experiment are mutually exclusive. Since the list is exhaustive, so the
probabilities of all the eventssum upto 1.
Properties of Discrete Probability Distribution P(x)

0< P(x,) <1, for each value of x,


69
UNIT5
Discrete Probability Distribution

NOTES — > PG,)1


Looking at Table 5.1 it looks similarto what was studied in unit 2, like frequency distributions.
The frequency distribution is used for summarizing the values of observed data. A frequency
distribution is prepared by listing all possible outcomes when an experiment is conducted
and counting the frequency for each outcome. Measures of descriptive statistics like mean
and standard deviation are used to summarize frequency distributions. Similarly, we can
think of a probability distribution as a theoretical frequency distribution i.e. without
experimenting we assign the probability to each possible outcome and measure the center
and the spread of random variable x. The Center is measured by expected value (mean), the
spread is measured by variance or standard deviation. The formulas for mean, variance, and
standard deviation are shared below:

Excepted value (mean)= p = E(X)=)" x,* P(x;)


i=l

Variance = V(X) = }"(x,- EX)’ x P(x)

VQ) = 93(x;- i=l


x PO)
V(K)RY xP (x,)-209 xP) +
#1 i=l

V(XDx; PCX,)-2px wt p?
Fl

VRID XP}?
i=l

Standard deviation= SD(X)=/ V(X)


As these distributions make use of probability, they are useful in making inferences under
conditions of uncertainty.
Let’s clarify t the difference between frequency distribution and probability distribution by
taking the example of tossing a coin.

A random experiment is conducted where we toss an unbiased coin 100 times and observe
that 49 times we get tail and 51 times head. This is represented as

70
OPMC001
Business Statistics

Table 5.3: Frequency distributions of tossing a coin

X_ | Freq

H |51
T |49

n= |100

A frequency distribution is the listing of the values that occurred when the experiment was
conducted, so frequency distribution provides the information about the variable on actual
data, whereas probability distribution is a listing of probabilities of all possible outcomes
when the experiment is conducted in the future, so probability distribution helps us in
finding the possible behaviours of a variable.
Table 5.4: Probability distributions of tossing a coin
X | P(x)

H /|0.5

T |0.5

Since it is assumed that there is an equal possibility of head or tail when a coin is flipped.

Example 5.1: Suppose that Mr. Sharma wants to start a business of selling ice cream. Since he
has limited funds, he can initially make 200 cups. So he decides to make 100 cups of both
flavours - vanilla and butterscotch. After a week he wants to know whether the demand for
both the flavours of ice cream is similar or one particular flavour is more in demand. Is it
beneficial to switch to only one flavour? The data he collected are shown in the Table 5.5
below:
Table 5.5: Demand of Flavours of Ice-cream

Day | Vanilla (x) | Butterscotch (y)

1 70 60
2 80 70
3 70 70
4 100 80
5 70 50
6 100 70
7 90 60

71
UNITS
Discrete Probability Distribution

Solution:

Table 5.6: Probability Distribution for Vanilla Flavour

X | Frequency P{x) xx P{x) | xA2 x P(x)

70 3 0.285714 20 1400
80 1 0.285714 | 22.86 1828.57
90 fi | 0.142857 12.86 1157.14
100 2 0.285714 | 28.57 2857.14
Total 7 1 84.29 7242.86

Mean =E(x)=84.29 i.e. the average demand for vanilla flavour ice cream was 84.29 cups
Variance= V(x)=138.78

Standard deviation=SD(x)= square root of variance =11.78


Table 5.7: Probability Distribution for Butterscotch Flavour

Y ‘| Frequency P{y) yxPly) | y42x Ply)

60 2 0.28571 17.143 | 1028.571

70 0.42857 30 2100
80 1 0.14286 11.43 914.29
90 1 0.14286 12.86 1157.14

7 1 71.43 5200

Mean =E(y)=71.43 i.e. the average demand for butterscotch flavour ice cream was 71.43
cups.
Variance= V(y)=97.95

Standard deviation=SD(y)=9.89
The demands for both the flavours of ice cream are not similar, vanilla flavour is more in
demand. It is not beneficial to switch to only one flavour i.e. vanilla as on an average the
demand for butterscotch flavour is also high.
Properties of Mean of Random Variables

a) IfX and Y are random variables, then E (X + Y)= E(X) + E(Y)


b) _—iIfX,,X,,...,.X, are random variables then E(X,+X,+...,+X,)= E(X,}+ E(X,)+........ +E(X,)=F E(X)
c) For random variables X andY, ifX and Y are independent, E(XY}= E(X) E(Y)

d) _—_—Ifaisanyconstant and Xis a random variable E(ax)=a x .E(X) and E(X+a)=E(X)+a


- e) For any random variable, if X>0, then E(X)>0
OPMCO001
Business Statistics

f) E(Y)= E(X), ifthe random variable X and


Y are such that
Y2 X

Properties of Variance of Random Variables

a) The variance of any constant is zeroi.e. V(a)=0, where ais any constant

b) IfXis a random variable, and a and bare any constants, then V(aX + b) = a’ V(X)

c) V(X+Y)=V(X)+V(Y)

d) V(X-Y)=V(X)-V(Y)
e) For any pair-wise independent random variables, X,, X,, ..., X,and for any constants a,,
ap)» a,; V(a,X,
+a, X, +. +a,X,) =a, V(X,) +a, V(X.) +... a, V(X,).

Example 5.2 The marginal probability distributions of the rate of return for two stocks are
shown below:

AXIS(x)) 0 1 2 HDFC{y)) 0 1 2
p(x) 0.5 0.3 0.2 p(y) 0.4 0.5 0.1

a) | Computethe mean and variance of AXIS Bank (X).

b) Compute the mean and variance of HDFC Bank (Y).


c) Note that X and Y are independent and find their bi variate distribution.

d) Determine the probability distribution of the random variable X + Y.


e) Calculate E(X+Y) directly by using the probability distribution ofX+ Y.

f) Calculate
V (X + Y) directly by using the probability distribution of
X + Y.

g) = Verify that V(X + Y) = V(X) + V(Y). Did you expect this result? Why?
h) — Findthe probability distribution of the random variable XY.

i) Calculate E(XY) directly by using the probability distribution of XY.


}) Verify that E(XY) = E(X)E(Y). Did you expect this result? Why?

Solution

E(X)= 4, = > XP;


a) i=l

E(X)=0*0.5+1*0.34+2*0.2 =0.70

V(X) =02 =)x x P(x)-(E(X)


i=1

V(X)= (07*0.5+12#0,3422*0.2)-(0.77
=0.61
73
UNITS
Discrete Probability Distribution

E(Y)}= 1, = >, y,P(Y)

b) i=1

E(Y)-0*0.4+1*0.5+2*0.1 =0.70
a

VY)=07 =) yx PO)-(EMY
i=l

V(Y)}= (07*0.5+12*0.4422*0.1) -(0.777


=0.41
c) The table below is known as the joint probability distribution table
Table 5.8: Joint Probability Distribution
X
Y 0 1 2

0 .20 12 08

1 25 15 .10

2 .05 .03 .02

Let us now learn how to estimate joint probability distribution table from the marginal table
1strow: (0,0) i.e. X=0 & Y-0, (1,0) i.e. X=1 & Y=0, (2, 0) i.e. X=2, Y=0

when x=0 and y=0 probability = 0.5 x 0.4 = 0.2, x=1, y=0, probability=0.3 x0.4=0.12, x=2, y=0,
probability =0.08 andso on

d) Theprobability distribution of random variable X+Y


Table 5.7: Probability of Random Variable
X+Y
xty [0 1 2 3 4
p(x+y) | 0.20 0.37 0.28 0.13 0.02
How to estimate X+Y
There is only 1 option for X+Y=0, i.e. when X=0, Y=0, probability=0.5 x 0.4 =0.20

There are two options for X+Y=1, i.e. when X=0, Y=1 & X=1, Y=0 probability=0.5 x 0.5+ 0.3 x
0.4=0.37

There are 3 possible options for X+Y=2 is when X=0, Y=2 or X=1, Y=1 or X=2, Y=0, probability=
0.5x0.1+0.3 x0.54+0.2 x0.4=0.28 and soon.

e) The E(X+Y)=0x0.2+1x0.37+2x0.28+3 x0.13+4x0.02


E(X+Y)=1.40
74
OPMCO001
Business Statistics

cae
f) The V(X +Y) =0° x 0.2+1°x 0.3742°x 0.28+3°x 0.13+4°x0.02-(1.40)° i

V(X+Y)=1.02
8) V(X) + V(Y) = 0.61 + 0.41 = 1.02 = V(X + Y). Yes, since X and Y are independent random
variables.

h) the probability distribution of random variable XY

xy |0 1 2 4

p(xy) 0.70 0.15 0.13 0.02

The option for XY=0, there are 5 possible options (0,0), (0,1), (0,2), (1,0), (2,0)

probability=0.5 x0.4+0.5 x0.5+0.5 x0.1+0.4x0.3+0.4x0.2 =0.70


XY=1, there is 1 possibility (1,1) probability=0.3 x0.5=0.15
XY=2, there us 2 possibilities (1,2), (2,1), probability=0.3x 0.1+0.2*0.5=0.13 andsoon
XY=4, there is 1 possibility (2,2) probability=0.2 x 0.1=0.02

E(XY) =0x 0.7+1x0.15+2 x0.13+4.x0.02


E(XY)=0.49

h) E(X)E(Y) = (0.7) (0.7) = 0.49 = E(XY). Yes, since X and Y are independent random
variables.

Examples based on Section 5.3

5.4 The probability distribution of a discrete random variable X is shown below, where X
represents the number of scooters owned bya family.

x |0 1 2 3

p(x) 0.25 0.40 0.20 0.15

a) Find the following probabilities:


I P(X>1)
ii, P(X<2)
iii, P(1<X<2)

iv. P(O0<X<1)
v. P(1<X<3)

b) Find the expected value and standard deviation of X.

c) Apply the law of expected value of X, find the following:


E(x’)
75
UNIT 5
Discrete Probability Distribution

ii, E(2X’+5)
ii. E(X-2)
d) Apply the laws of variance to find the following:
I. (3X)

ji, V(3X-2)
iii. (3)
iv. V(3X)-2
Solution:
a) 1.0.35, ii.0.85, iii0.60, iv0.00, v.0.60
b) &(X)=1.25 scooter,o=0.9937 scooter
c) |. 2.55, ii. 10.1, iii. 1.55

d) ‘1.8.89, ii. 8.89, iii, iv6.89

5.5 Determine which of the following are not valid probability distributions, and explain
why or why not?

x 0 1 2 3
a
p(x) |0.15 0.25 0.35 0.45

x |2 3 4 5
b
p(x) |-0.10 0.40 0.50 0.25

x -2 -1 0 1 2
c
p(x) |-0.10 0.20 0.40 0.20 0.10

Solution:
a. This is not a valid probability distribution because the probabilities don't sum to one.

b. —_ Thisisnota valid probability distribution because it contains a negative probability.


c. This is a valid probability distribution.

76
OPMC001
Business Statistics

5.6 Whichofthe following Probability distributions are valid?

ayox | pe] YY x | po] Of x | pw] YD) x | pw


0 0.1 0 0.1 10 11 10 0.2

1 0.3 1 0.3 11 0.3 11 0.3

2 0.5 2 -0.5 12 0.2 12 0.2

3 0.01 3 0.2 13 0.2 13 0.2

4 0.05 4 0.7 14 0.1 14 0.1

5.7 For the given probability distribution

a} Complete the probability distribution.


b) What is the probability that x is at least two?

c) What is the probability of X being negative?

d) What is the probability of X being less than 15?

X p(x)
-10 | 0.05
5 0.15
0)
4 0.1
8 0.25
12 0.3

5.9 The joint probability distribution of variables X and Y are shown in the Table below,
where X is the number of umbrellas and Y is the number of raincoats sold daily in a
small store.

X
Y 1 2 3

1 0.30 0.18 0.12

2 0.15 0.09 0.06

3 0.05 0.03 0.02

Il. Calculate E(XY)


ii. Determine the marginal probability distributions of X and Y.

iii. Are X and Y independent? Explain. 77


UNITS
Discrete Probability Distribution

Calculate the expected values of X and Y.


Calculate the variances of X and Y.

Find the probability distribution of the random variable X + Y.


vii. Calculate E(X + Y) and V(X + Y) directly by using the probability distribution of X
+Y,

viii. Verify that V(X + Y) = V(X) + V(Y). Did you expect this result? Why?

Solution

I. 2.55

x 1 2 3

p(x) 0.5 0.3 0.2

y 1 2 3
p(y) 0.6 0.3 0.1

Yes, because p(x,y) = p(x) - p(y) for all pairs (x, y).
E(X) = 1.7 and E(Y) =1.5

V(X) = 0.61 and V(Y) = 0.45


vi.
xty [2 3 4 5 6

p(x+y) | 0.30 0.33 0.26 0.09 0.02

E(X + Y) = 3.2, V(X + Y) = 1.06


vii. V(X) + V(Y) = 0.61 + 0.45 = 1.06 + V(X + Y). Yes, since X and Y are independent
random variables.

5.10 Let the random variable X represents the number of girls in a family. If there are three
childrenin a family.

a) Construct the probability distribution for X.

b) Find the mean and standard deviation of X.

Solution

1girl0.375 2girls0.375 3 girls0.125 Ogirls0.125


n. Mean=1.50; SD=2.31

78
5.4 THEIMPORTANCE OF EXPECTED VALUE IN DECISION-MAKING
As discussed in the previous section, the expected value provides information about the
center of the probability distribution. Apart from knowing the average (center), expected
value combined with monetary benefits is a very useful concept in economics, finance, etc.
To elaborate it let us look
at two different types of applications of expected value.

Example 5.3: You are interested in investing your money in the stock market. Looking at the
record you have developed the following probability distribution of the rate of return for the
stock.

Scenario | rate of return | probability

Excellent 10% 0.35

Good 5% 0.45
Bad -1% 0.2

a) Calculate the expected rate of return.


b) Calculate the variance and standard deviation for the rate of return.

Solution
X p(x) xx P(x) | x42 x P(x)
10 0.35 3.5 35

5 0.45 2.25 11.25

-1 0.2 -0.2 0.2

1 5.55 46.45

E(X)=>:x,*P(x,)
a =5.55
VOQR=D? x2P(x,)-(ECK))’
i=1 = 55-(1.5) 42= 15.65

SDX&XFyVOD — 3.96
The expected rate of return is 5.55% with a standard deviation of 3.96%.
As the expected gain is positive, you should invest in stocks.
Example 5.4: A newspaper vendor has a stall at New Delhi railway station, he wants to know
the number of copies he should procure to satisfy the daily demand. He procures the
newspaper at Rs 3 and sells it at Rs 5. If any newspaper is unsold, it is a loss for the vendor.
Based on his experience the vendor has estimated the following probability distribution for 79
UNIT5
Discrete Probability Distribution

the number of copies demanded. How many copies should he procure to maximize his
profit?

No. of copies| Probability

50 0.07

60 0.11
70 0.33
80 0.26

90 0.19
100 0.04

Solution:
Profit/copy= selling price- procurement price = Rs 5-Rs3=Rs2

Loss= unsold copy= Rs 3

No. of copies demanded

50 60 70 80 90 100

Expected
Probability 0.07 0.11 0.33 026 0.19 0.04 profit

No. of
copies
procured

50 100 150 150 150 150 150 146.5

60 70 180 180 180 180 180 172.3


70 40 90 210 210 210 210 184.9

80 10 60 110 160 160 160 122

90 -20 30 80 130 180 180 103.5

100 -50 0 50 100 150 200 755

Let us see how we obtained the value in the Table.

If no. of copies procured is 50 and demand is 50 then profit = 50 x 2 = 100


If no. of copies procured is 50 and the demand is 60, in this case, we can satisfy the demand of
only 50 customers for 10 customers it is opportunity loss but there is no cost involved so the
profit = 50x 2= 100 irrespective the demand is higher than 50 (see the entire 1st row).
ss Similarly, let us try to understand the second-row calculation.
OPMCO001
Business Statistics

If no. of copies procured is 60 and demand is 50 then profit is for S50 copies sold for the
remaining 10 copies itis aloss of Rs 3 =50x2-10x3= Rs70

Ifno. of copies procured is 60 and demand is 60 then profit= 60x 2=120 andsoon.
The maximum profit of 184.9 is obtained when the vendor stocks 70 copies of the
newspaper.
Examples based on 5.4
5.11 The probability distribution of a random variable X is shown below, where X
represents the amount of money {in 1,000s}) gained or lost in a particular game of
Rummy.

x |-4 0 4 8

p(x) /0.15 0.25 0.20 0.40

a) Find the following probabilities:


i. P(X<0)

ii, © P(X>3)
iii, P(O<X<4)
iv. P(X=5)
b) Find the following values and indicate their units.
I. E(X)

ii, V(X)
ii. SD(X)
Solution:
a) 1.0.40 ii.0.60 iii0.45 iv0.000

b) i$3.40, ii19.65 dollarssquared iii$4.43


5.12 The lottery seller sell 10,000 tickets. The cost of 1 ticket is Rs 10. The winner shall win
the prize of Rs 10,000. What is the expected value of winning the prize?
5.13 A bag contains 20 items of which 5 are defective. A sample of 10 items is selected at
random from this box. If x represents the number of defective items of 10 selected
items, describe the random variable x completely, and find its expectation.
5.14 A manager at the call center wants to estimate the effectiveness of the salesman. To
check his effectiveness, he estimates the average number of units the salesman sells
per sales call. He checks his records and comes up with the following probabilities. Is
the salesman effective?
Sales (units) 0 1 3 4 5

Probability 0.05 0.1 0.15 0.5 0.2


81
UNITS
Discrete Probability Distribution

5.15 Astreet vendor has experienced that the demand for his idlis varies every day. Since
the idli stays fresh only for one day, he has to discard the unconsumed idli’s. From his
experience, the probability distribution for demand is shared below. The cost of
preparing a plate is Rs 20 and he sells each plate for Rs 70. He has also noticed that in
addition to overstock cost there is another cost of Rs 50 if a customer returns due to
insufficient stock. He needs your help to know how many idlis he should stock so that
he can minimize his obsolescence loss and opportunity loss.

Demand 122 124 126 128 130


Probability 0.35 0.15 0.25 0.2 0.05

5.5 BINOMIAL PROBABILITY DISTRIBUTION


In the coming sections, we shall discuss the special cases of discrete probability distribution
like the Binomial probability distribution, the Poisson probability distribution. In this section
we shall focus on Binomial probability distribution “Bi’ means two, so in the binomial
probability distribution, one of the important properties is that when a random experiment
is performed it results in only two outcomes. The binomial distribution is an extension of
Bernoulli distribution.
Let’s discuss Bernoulli distribution before we start with Binomial distribution.

Arandom experiment that results in only two mutually exclusive outcomes e.g. success and
failure then the random variable follows Bernoulli distribution. If the experiment is repeated
n number of times and in each trial probability of success p (O<p<1) is constant, then such
trials are known as Bernoulli’s trials. The probability of success is referred to as p whereas the
probability of failure as q=1-p.

Table 5.9: Some examples of Bernoulli Distribution


Bernoulli Experiment Possble Probability of
Outcomes Success
a student will pass or fail in the final exam 1= pass p=0.7

0= fail
anewbom child is either girl or boy 1= girl p=0.5

0= boy

a person is either smoker or non-smoker 1=non-smoker | p=03


0= smoker

throwing an unbiased dice result in either even 1=2,4,6 P=05


mimber or odd number
0=1,3,5

The probability distribution for a random variable X that follows Bernoulli distribution with
82 probability pis written as:
OPMC001
Business Statistics

Pox=ay{™ (1 - p)"”, for x= 0,1


, for all other values ofx

Here x assumes 0 for failure and 1 for success.

Conditions for Bernoulli Trials

1 The random experiment results in a finite number of trials.


2 Each trial has exactly two outcomes: success or failure.

3. Trials are independent of each other.


4 The probability of success or failure is the same in each trial.

The expected value and variance of Bernoulli distribution are given below:

E(X)= 0*(1-p) +1*p =p


V(X)= E(X")- (E(X))’ = (1p +0°(1-p))- (p)’ = p-p* = p(1-p)

Table 5.5: Lets summarize the Bernoulli distribution

Parameters p=probability of success


PDF P(X =x) =p(x)=p*x q™
Range x=0,1
Expected value E(X) | P
Variance V(X) Pq
SD(X) pq

Example 5.5 A retailer feels that customers prefer credit cards over cash to purchase items
that are above Rs 1,000. The probability of buying items using a credit card was 70% for the
items worth Rs 1,000 or more. To justify his belief, he observes the purchasing pattern of
customers. Suppose 5 customers are standing in the queue to make payment.

a) Does this example satisfy the conditions of the Bernoulli process?


b) Construct the probability distribution.

c) Using the above probability distribution, construct the binomial probability


distribution.

Solution:
The customer either uses a credit card or cash to make the payment, so there are only two
possible outcomes (70% success, if he uses a credit card, 30% failure if he uses cash as a mode
of payment). The mode of payment is independent of each customer, the probability of
success and failure is constant. Hence, the example satisfies the conditions of the Bernoulli
process.

83
UNITS
Discrete Probability Distribution

customer
using credit
Customer 1 | customer 2 | customer 3 | customer 4 | Events | card x Prob

s s s $s Ssss |4 02401
F SSSF |3 0.1029
F s SSFS |3 0.1029
F SSFF |2 0.0441
F s s SFSS |3 0.1029
F SFSF |2 0.0441
F s SFFS |2 0.0441
F SFFF | 1 0.0189
F s s s FSss |3 0.1029

F FSSF |2 0.0441

F $s FSFS | 2 0.0441
F FSFF | 1 0.0189
F Ss Ss FFSS |2 0.0441
F FFSF |} 1 0.0189
F s FFFS |1 0.0189
F FFFF | 0 0.0081

As our area of interest is not to identify the individual customer who uses a credit card, rather
our interest is to find the no. of customers who use a credit card, so we can combine the same
values of x.

x p(x)
0 0.0081

1 0.0756
2 0.2646
3 0.4116

4 0.2401

The above table represents a binomial distribution. In this way, the Bernoulli process when
repeated n times converges to a binomial distribution.

Binomial Distribution is an extension of Bernoulli distribution

- When a random experiment yields only two outcomes that are mutually exclusive and
OPMCO001
Business Statistics

collectively exhaustive and the experiment is repeated n times independently, then the NOTES
random variable follows binomial probability distribution. The probability of success is
denoted by p and the probability of failure is denoted by q=1-p. Suppose that the experiment
is repeated n times and we get success for x times and failure for the remaining i.e. n-x times.
Since out of n times, we get x successes, the total number of ways in which we can attain
successis ,C..
The probability distribution for a random variable X that follows Binomial distribution with
probability pis written as:

P(X=x)=p(x)= Cx p*xq ) XEO) 1,2) sescccsassees n


Conditions for Binomial Distribution:

1. Each trial results in only two outcomes which are mutually exclusive and collectively
exhaustive.
2. | Thenumber of trials ‘n’ is finite.

3. Thetrials are independent of each other.


4. The probability of success remains constant for each trial.

The binomial distribution is widely used for solving the problems; some examples where we
can apply binomial distribution are: whether the item produced in a manufacturing plant is
defective or non-defective, whether the firm will obtain the tender or not, whether the
potential customer will buy the product or not, whether the student will pass the exam or
not, whethera candidate will obtain ajob or not, etc.

The formulas shared in Section 5.3 are used to calculate the expected value, variance and
standard deviation fora binomial distribution. The formulas simplify
to
E(X)=np

V(X)=npq,
SD(X)=, | npq

For binomial distribution variance < mean. If np is a whole number then the distribution is
unimodal, mean =mode=np
Let us examine the binomial distribution graphically when its parameters n and p change.

85
UNIT5
Discrete Probability Distribution

Fig. 5.3: Shape of the binomial distribution for different n & p

n=10,p=0.1 n=10,p=0.5 n=10,p=0.9

okecekeges

toe
————

eo8e2eGe

esekeke
————

S =a
ae ——
———
ca


<<

a
|
—— ———
~~

- —

Re
'
'

'
'

'

=
=

ft

0
n=5,p=0.1 n=S,p=0.5 n=5,p=0.9

ofRESEEE
efeE SEE

ok2GERe

—==
Ft

Eq
=o

az
=

=
a

ug
a

|
|

n=20,p=0.1 n=20,p=0.5 n=20,p=0.9


eo
e

os

e2

os
e

Hh. = HHI
on
8

S eacoes il li an ol all
o2z#46¢ EWP KRHNM UBD ORT E45 © 7 SF Fi 8215 415 617 181920 OL2I45
6 7 FH Wass wi www

From the graph, we can see that for the same value of sample size n, the shape of distribution
changes for different values of probability. If p < 0.5 the shape of the distribution is right-
skewed, for p=0.5 the shape of the distribution is symmetric, and for p > 0.5, the shape of the
distribution is left-skewed. The probability of success lies closer to the expected value. The
value of variance is highest when p and q are equal.
Similarly, we find that if p is constant and n increases then the shape of distribution
approaches symmetry.

Table 5.10: Let’s summarize the binomial distribution

Parameters n=number of trials


p=probability of success
PDF P(X =x)
= p(x) = ,C,x px q™
Range x= 0,1,2,......... wn
Expected value E(X) | np
Variance V(X) Npq

Shape I p <0.5 the shape of the distribution is right-skewed,


p=0.5 the shape of the distribution is symmetric
p>0.5, the shape of the distribution is left-skewed
86 Excel function BINOM.DIST(x,n,p,0)
OPMC001
Business Statistics

Example 5.6: In industry there is a 30% chance that accidents occur due to chemical leak.
a) —_‘ Construct the probability distribution.

b) What is the probability that out 20 workers, 8 or more will suffer an injury due to
chemical leak?
c) | Whatisthe meanand variance of an accident?

Solution:
Here n=20, p=0.30,

a) P(X=x)=,,C, (0.3)'(0.7)"".
P(X=0) =,,C, (0.3)°(0.7)”
=1x1x0.000798

=0.000798
P(X=1)=,,C, (0.3)'(0.7)"
=20x0.3x0.00114

=0.006839
Similarly substituting the values of x we get the following table

X P(X)
0 0.00079792266
1 0.00683933711
2 0.02784587252
3 0.07160367221
4 0.13042097437
5 0.17886305057
6 0.19163898275
7 0.16426198522
8 0.11439673970
9 0.06536956555
10 0.03081708090
11 0.01200665490
12 0.00385928193
13 0.00101783260
14 0.00021810699
15 0.00003738977
16 0.00000500756
17 0.00000050496
18 0.00000003607
19 0.00000000163
20 0.00000000003
87
UNIT 5
Discrete Probability Distribution

NOTES b) P(X>=8)= 1- [P (0) +P (1) +P (2) +...+P (7)]


= 1- [20Co (0.3)°(0.7)"°+20C1(0.3)'(0.7)9+.........+20C7(0.3)"(0.7)7]

-1-| ce x0.3°x0.77 + zal x0,3'x0,79 + aot


01x20! 11«19! 21x18! x9.32%0,78+ Gacesacl +20!
71x13! 03707 |

= 1- 0.772272
= 0.227728
c) ECX= np=6, VC)= npg=4.2

The probability of 8 or more workers suffering an injury due to chemical leak is 0.227728 and
onan average 6 workers suffers an injury with a variation of 4.2 due to chemical leak.

Using Binomial Tables


As in the above example, it is time-consuming to calculate the probabilities using the
binomial formula as the sample size increases. So to overcome this issue, we have an option
to use binomial tables or Excel. The details about using the formulas of excel are shared
towards the end of the chapter. Let us solve the same problem using the binomial table now.
The binomial table is shared towards the end of Appendix Table 1,

To understand how we use Binomial Tables let us take an example.

InanR & Dlab, there are 19% chances of a radiation leak. What is the probability that out 20
workers, 10 or less will suffer an injury due to radiation leak?

Here n= 20, p=0.19 and x=0.1, 2,......10


As the number of trials is 20, look for the table corresponding to n=20. Go across the column
till you can locate the value p=0.19.

These are the values that we will find corresponding


to n=20, p=0.19

88
OPMCO001
Business Statistics

x P(x) NOTES
0 0.0148
1 0.0693
2 0.1545
3 0.2175
4 0.2168
5 0.1627
6 0.0954
7 0.0448
8 0.0171
9 0.0053
10 0.0014
11 0.0003
12 0.0001
13 0.0000
14 | wee
195 0 | we
16002=« | oe
TW | wee
18 |
19 | -—-
20 | wee

Just adding the values from 0 to 10, answer the above question.

Using Excel to solve problems of a binomial distribution


On average, 11 percent of married couples file for divorce in the Maldives. In a random
sample of 6 married couples, what is the probability that one couple will file for divorce?
Here n=6, p=0.11

P(X=x)=,C, x(0.11)* x (0.89)™, x=0,1,2,3,4,5,6


Binomial Function in Excel {Formulas > More functions > Statistical > BINOMDIST}

The function requires


4 arguments to be filled in the dialog box shown below in Fig. 5.4

89
UNIT5
Discrete Probability Distribution

Fig. 5.4: Binomial Distribution to calculate PDF

Function Aequmerts ? x

Gros. oc
Numbers 1 =!
Yeats =6
peebeys
ion sR =
Cumaamee (dR)= reuse
© O68
Reduers the Mawdsal lem beroees! GithO lon predsdaaty.

CametasOve 15 2 lognal vatve for the Cvensatwe Gitridlon Rercton, use TRUE for
the prodaduity
mast ancien, ute FALSE.

forevia rere ©

tee caste tention

Fig. 5.5: Binominal Distribution to calculate CDF

Functos Aequments ? x
GNOMOGT
=!
=6
haf

= 61

= Tt
apne

Batons the inGvidual teom Briommul GatfA10n prodabaty


Cummasiaiwe 6 8 Gpdal vee 00 Ie CUmeiatee Guindvtion Renchon, ute TELE for
the prodabatymass funchon, wee 1445.

Formefa reve = O.B6SS29215

tensa tiecias

Number_s: the number of success in n trial i.e. x. We enter 1


Trials: sample size. We enter 6

Probability_s: the probability of success in each trial i.e. p. we enter 0.11

Cumulative: it is a logical value. If we enter O or FALSE it calculates P(X=1) which is pdf


If we enter 1 or TRUE, it calculates P(X<=1) which is CDF

Excel formula: for PDF BINOM.DIST(x, n, p, FALSE)


For CDF BINOM.DIST{x, n, p, TRUE)

Using Excel we get,

90
OPMC001
Business Statistics

Table 5.7: Excel formulas to calculate PDF of Binomial distribution

X PDF - P(X) P(X} What formula calculates

0 0.496981 = BINOM.DIST (0, 6, 0.11, FALSE) X=0


1 0.368548 = BINOM.DIST (1, 6, 0.11, FALSE) X=1
2 0.113877 = BINOM.DIST (2, 6, 0.11, FALSE) X=2
3 0.018766 = BINOM.DIST (3, 6, 0.11, FALSE) X=3

4 0.00174 = BINOM.DIST (4, 6, 0.11, FALSE) X=4


5 8.6E-05 = BINOM.DIST (5, 6, 0.11, FALSE) X=5

6 1.77E-06 = BINOM.DIST (6, 6, 0.11, FALSE) X=6

Table 5.8: Excel formulas to calculate PDF of Binominal distribution

X CDF - F(X) F(X) What formula calculates

0 0.496981 = BINOM.DIST (0, 6, 0.11, TRUE) X=0

1 0.865529 = BINOM.DIST (1, 6, 0.11, TRUE) X=0+1


2 0.979406 = BINOM.DIST (2, 6, 0.11, TRUE) X=0+1+2
3 0.998173 = BINOM.DIST (3, 6, 0.11, TRUE) X=0+1+2+3

4 0.999912 = BINOM.DIST (4, 6, 0.11, TRUE) X=041+2+3+4


5 0.999998 = BINOM.DIST (5, 6, 0.11, TRUE) X=0+1+24+3+4+5

6 1 = BINOM.DIST (6, 6, 0.11, TRUE) X=0+14+24+3+4+5+6

Fig. 5.6: Graph of PDF & CDF of the data

P(X) F(X)
0.6 L5
0.4 1

0 0
0123 4 5 6 0123 4 5 6

Examples based on Section 5.5


5.16 For the parameters n=10, p=0.3, using the formula of binomial distribution estimate
the following probabilities
a) P(x=2)
91
UNIT5
Discrete Probability Distribution

; j b) P{x>=4)
c) P(x<=12)

d) P(x<5)
5.17 For the parameters n=20, p=0.2, using the Binomial table estimate the following
probabilities.

a) P(x=6)
b) P(x>=12)
c) P(x<=10)
d) P(x>8)
5.18 Find meanand standard deviation for the following binomial distribution.
a) n=10, p=0.2

b) n=50, p=0.45

c) n=82, p=0.06
d) n=300, p=0.25

5.19 For n= 14, compute the probabilities of x>=3 for following values of p.
a) p=0.15

b) p=0.25
c) p=0.35

d) p=0.45
5.20 Inthe Holiday Inn hotel, 40 percent of the customers pay by credit card.
a) Ofthenext15 customers, what is the probability that all of them pay by cash?

b) Fewerthan5 pay bycredit card?


c) Findthe expected number of customers who make payment using credit card?

d) Findthestandard deviation.
e) Constructthe probability distribution function.

f) Plotthe graph and describe the shape.

a) What is the probability that at most 8 women executives go on for 7 days of


vacation?

5.6 POISSON DISTRIBUTION


Another important discrete probability distribution is a Poisson probability distribution,
named after French Mathematician Simeon-Denis Poisson (1781-1840). The Poisson
probability distribution is relevant to business problems in which there are few successes
against a large number of failures or vice-versa. For example, no. of accidents in a city per
92
OPMCO001
Business Statistics

week, no. of customers arriving in the bank per minute, etc. As the probability p of a NOTES
particular event happening is very low the Poisson distribution is also referred to as the
distribution of rare events or, the law of improbable events. The Poisson process measures
the number of occurrences of a specific outcome of a discrete random variable in a fixed time
interval, space or volume for which an average number of occurrences of an outcome is
known or can be estimated.
So the difference between Binomial and Poisson distribution is: A Binomial random variable
counts the number of successes in fixed Bernoulli trials whereas Poisson random variable
counts the number of successes over the fixed interval of time or space.
Arule of thumb says for the approximation to be good:

“The sample size n should be equal to or larger than 20 i.e. (n 2 20) and the probability of a
single success, p should be smaller than or equal to 0.05 i.e. (p s 0.05). ifn > 100 or np < 10
then the approximation is excellent.”

Conditions
for Poisson Probability Distribution

The apply Poisson distribution, the random variable should satisfy the following conditions:
1. The number of successes within a specified time or space interval equals any integer
between zero and infinity.
2. The numbers of successes counted in non-overlapping intervals occur randomly and
are independent of each other.

3. The probability that success occurs in any interval is the same for all intervals of equal
size and is proportional to the size of the interval.

4. The average number of occurrences is constant for all time intervals of the same size.
Examples of Poisson Distribution

® The number of cars that cross Bandra-Worli Sea Link between 9AM to 12 AM
during week days.
® Thenumber of patients
at OPD waiting per hour.

® Thenumber of organisms per unit volume present in any liquid.


e Thenumberof leaks presents ina single stretch ofa pipeline.

The first two random variables follow a Poisson distribution concerning specific time and
third and fourth random variable follows a Poisson distribution concerning space.
The probability density function for Poisson Distribution

For the Poisson random variable X, the probability of x successes over a given interval of time
or space is given by,

e*1*
P(X=x) =

93
UNIT5
Discrete Probability Distribution

Where Ais the mean number of successes and e=2.718 is the base of the natural logarithm
Expected Value, Variance and Standard deviation of a Poisson random variable

E(X)=p=A
VXY)=07=4
SDQ) =a =r

For Poisson distribution mean = variance. The ease of Poisson’s formula makes it an eye-
catching model compared to binomial. Poisson distribution is a left-skewed distribution but
as the value of A increases, it moves more towards symmetry.

Fig. 5.7: Shape of Poisson Distributions

A=0.1 K=0.5 hel


os
~

Os Oss

“e cos
os os

04
02 o1
O2
o1 oos
i ° a. ° I -
o123456789 oi2s4858 678 9 o1235:345 6789

he iS Ae2 h=3.5
ozs
02
02
os
ous
os 01
00s | oo | |
° I 2. ° I a.
o123s348 6789 o1i234567869 or23456789MUuLD

Table 5.9: Let’s summarize the Poisson distribution

Parameters i. = mean arrival per unit of time orspace


PDF e*7*
P(X=x) =

Range R= OA 2c oo
Expected value E(X)| 4
Variance V(X) A
SD(X) Vd
Shape the shape of thedistribution is right-skewed
Excel function POISSON. DIST(x, 4,0)
Example 5.7 Traffic police are planning to make some roads one way to avoid accidents. As
per the record that on an average there are 4 accidents per week at a particular intersection
onthe Kapasera border.
94
OPMC001
Business Statistics

a) Construct the probability distribution.


b) Find the probability of 0 accidents in a week.

c) Find the probability of at least 4 accidents per week.


d) Find the probability of at most 3 accidents per week.
Solution :
a) Here X= 4
ye e* a
P(X=x) = 5

P(X=0) = (e**4°y/0!
= 64 = 0.0183
P(X=1) = (e*4'y/41!
= e**4 = 0.0733
Similarly substitute different values of x till you get P(x) as 0.0000 as shown in the Table
below:
X P(X)
0 0.018316
1 0.073263
2 0.146525
3 0.195367
4 0.195367
5 0.156293
6 0.104196
7 0.05954
8 0.02977
9 0.013231
10 0.005292
11 0.001925
12 0.000642
13 0.000197

b) P(X=0)=0.018316
c) P(X24)=0.56645
d) P(X <3) =0.43347

Using Poisson Table


Just as we saw that binomial tables can be used to answer the questions of binomial, in the
same way, one can use the Poisson distribution table to solve the examples of Poisson
distribution. Let us solve the one problem using the Poisson table now. The Poisson table is
shared towards the end of the book - Appendix Table 2. 95
UNITS
Discrete Probability Distribution

Let look at the example, suppose the average number of customers walking in a bank for
some service is 3/hour.

a) Prepare the probability distribution.


b) What is the probability that less than 4 customers are standing in a queue to be
served?

Solution:

a) From the Poisson probability distribution table, Appendix 2 we get the following
values
P(X)

oS) x<
0.0498
0.1494

N/R
0.2240
0.2240
|W
0.1680
0.1008
A) a

0.0504
0.0216
WwW) ON

0.0081
0.0027
0.0008
lo}
pp

0.0002
ras
e

0.0001
NI
BR

0.0000
BR
Ww

b) P(X < 4) = 0.6472


Using Excel to solve problems of Poisson distribution
Let us solve the above problem with Excel now

Here A= 3 per hour


e@ a
P(X=x) = a

Poisson Function in Excel (Formulas > More functions > Statistical > POISSON. DIST)
The function requires 3 arguments to be filled in the dialog box shown below in Fig. 5.8

96
OPMCO001
Business Statistics

Fig. 5.8: Excel formula to calculate the Poisson Distribution PDF

Function Arguenants ? x

x ee |

Mean [3 =}
Comutetve [df = MSE

© ANGROSIISG
Retunes the Posse Gxtruten,
Cumistwe a smegma irr nen apg A AT
Potssen pr@abiity
mats fusction, ute FALSE.

Formua rem = 0160031356

tin entna hancias

Fig. 5.9: Excel formula to calculate Poisson Distribution CDF

=.
Casechstve Cee ee ee ene
Ponr9s probedaty mass haben, use FALSE.

tice on tis functinn

X: the number of successes in n trial i.e. x. We enter 4


Mean: We enter 3

Cumulative: it is a logical value. If we enter 0 or FALSE it calculates P (X = 4) which is PDF

If we enter 1 or TRUE, it calculates P(X<=4) which is CDF


Excel formula: for PDF POISSON.DIST(x, A, FALSE)

for CDF BINOM.DIST(x, A, TRUE)

97
UNITS
Discrete Probability Distribution

NOTES Using Excel we get


Table 5.9: Excel formulas to calculate PDF of Poisson distribution

X PDF-P(X) Excel Formula what formula calculates

0 0.0498 POISSON.DIST {0, 3, FALSE) X=0

1 0.1494 POISSON.DIST (1, 3, FALSE) X=1

2 0.2240 POISSON.DIST (2, 3, FALSE) X=2

3 0.2240 POISSON. DIST (3, 3, FALSE) X=3

4 0.1680 POISSON.DIST (4, 3, FALSE) X=4

5 0.1008 POISSON.DIST (5, 3, FALSE) X=5

6 0.0504 POISSON. DIST (6, 3, FALSE) X=6

7 0.0216 POISSON.DIST (7, 3, FALSE) X=7

8 0.0081 POISSON.DIST (8, 3, FALSE) X=8

9 0.0027 POISSON. DIST (9, 3, FALSE) X=9

10 0.0008 POISSON.DIST (10, 3, FALSE) X=10

11 0.0002 POISSON.DIST (11, 3, FALSE) X=11

12 0.0001 POISSON.DIST (12, 3, FALSE) X=12

13 0.0000 POISSON.DIST (13, 3, FALSE) X=13

Table 5.10: Excel formulas to calculate PDF of Poisson distribution


X | CDF-F{x) |Excel Formula what formula calculates
0 0.0498 /POISSON.DIST (0,3, TRUE) | X=0
1 0.1991 |POISSON.DIST (1,3, TRUE) | X=0+1
2 0.4232 /POISSON.DIST (2,3, TRUE) | X=0+1+2
3 | 0.6472 |POISSON.DIST (3,3, TRUE) | X=0+1+2+3
4 0.8153 /POISSON.DIST (4, 3, TRUE) | X=0+1+2+3+4
5 0.9161 |POISSON.DIST (5,3, TRUE) | X=0+1+2+3+4+5
6 0.9665 /POISSON.DIST (6,3, TRUE) | X=0+1+2+3+4+5+6
7 0.9881 |POISSON.DIST (7,3, TRUE) | X=0+1+2+3+4+5+6+7
8 0.9962 /POISSON.DIST (8,3, TRUE) | X=0+1+2+3+4+516+7+8
9 0.9989 |POISSON.DIST (9, 3, TRUE) | X=0+1+2+3+4+54+6+7+8+9
10 | 0.9997 |POISSON.DIST (10, 3, TRUE) | X=0+1+2+3+4+5+6+7+8+9+10
11 | 0.9999 |POISSON.DIST (11, 3, TRUE) | X=0+1+2+3+4+5+6+7+8+9+10+11
12 | 1.0000 |POISSON.DIST (12, 3, TRUE) | X=0+1+2+3+4+51+6+7+8+9+10+11+12
13 1.0000 |POISSON.DIST (13, 3, TRUE) | X=0+1+2+3+4+5+6+7+8+9+10+11+12+13

98
OPMC001
Business Statistics

P(X)
0.2500
0.2000

0.1500

0.1000

0.0500

0.0000
0123456
7 8 9 10111213

1.2000
1.0000
0.8000
0.6000
0.4000
0.2000
0.0000
012345
678 910111213

Examples based on Sec 5.6


5.21 Assume that X is a Poisson random variable with A=1.5. Calculate the following
probabilities
a) P(X=1)

b) P(xX2>4)
c) P(Xs2)
d) P(X>3.5)
e) P(1<X<4)
5.22 Given a binomial distribution with n=30 and p=0.015 use a Poisson approximation to
the binomial to find
a) P(X>5)

b) P(X<8)
c) P(2<x<9)
d) P(X2>3) 99
UNIT5
Discrete Probability Distribution

5.23 Abankeris interested to know; how many credit card applications are processed at his
bank. If on an average 5 credit card applications are processed in a week. What is the
probability that at most 8 credit card applications would be processed ina fortnight?
5.24 An experienced waiter at Hyatt Hotel has a 0.3% chance of making an error while
taking an order. If he takes 500 orders, find the probability

a) Ofatleast3 errors
b) Fewerthan6errors

c) Whichdistribution is preferable binomial or Poisson? Explain.


5.25 Chances of misprint per page is three percent of the total if the book has 200 pages

a) Whatisthe expected number of misprints?

b) Whatis the standard deviation of misprints?


c) Whatisthe probability of at most 10 misprints inthe book?

d) Whatis the probability of at least 5 misprints in the book?

5.7. LETUSSUM UP

Excepted value (mean)= p = EX=>, x,x P(x,)


i=l

Variance = VOX) = '(x,-


i=l
ECO) P@,) or
VED i=l XZP,)H
Standard deviation= SDO)=,/ VCO
PDF of binomial distribution PAX =) =P@)= 2Cx* PX A™ 5919 on
Mean of binomial distribution =np

The standard deviation of binomial distribution SD(X)= vupq

e
PDF of Poisson distribution P(X=x) =
Mean of Poisson distribution =A

SD(X) of Poisson distribution = Vi

5.8 KEYWORDS
Bernoulli Process: A process that results in two outcomes, probability of success remains
constant and trials are independent of each other.

Binomial Distribution: A Bernoulli experiment when repeated n times results in a binomial


distribution.
100
OPMCO001
Business Statistics

Continuous Random Variable: A probability distribution in which a random variable takes NOTES
any value within a specified range.

Discrete Probability Distribution: A discrete distribution describes the probabilities of


occurrence of each value of a discrete random variable
Discrete Random Variable: A probability distribution in which a random variable takes any
countable value.
Expected Value: A weighted average of the outcome of an experiment.

Poisson Distribution: A discrete probability distribution in which the probability of


occurrence of an outcome within a very small interval of time is very small, and the
probability that two or more such outcomes will occur within the same small time interval is
negligible. The occurrence of an outcome within a one-time period is independent of each
other.
Probability Distribution: A list of all possible outcomes of random experiment with their
associated probability is defined as a probability distribution.
Random Variable: A variable that takes different values as a result of outcomes of a random
experiment.

5.9 CASE: DECISION ON THE COMPENSATION PACKAGE


ACEO of a large multinational company has a degree in analytics, he thought of a policy of
offering the option of choosing your salary plan. By choosing, their salary plans the
executives are responsible for their compensation. He thinks of a model where one option is
to give a huge salary and less amount of bonus or give a little less salary and a higher amount
of bonus. He knows that the bonus is very appealing to the executives as they feel valued in
the company. The newly joined executives are given an option to choose their salary bracket
from Rs 1,50,000 to Rs 2,00,000. By choosing a lower salary, the executive gets a chance to
get a large incentive in terms of a bonus if the company is not able to generate any profit in
that year than the company is not liable to give any bonus. The company hired two
executives Akshay and Mahesh. Each is given the option to choose from one of the plans.

Option 1a base pay of Rs 1,50,000 with a possibility of a large bonus.


Option 2 a base pay of Rs 2,00,000 with a possibility of half the bonus under Option 1.

Akshay & Mahesh have different types of responsibilities. From secondary data, they figured
out the probability of a profit by the company. Thinking about their risk, both construct their
probability distributions concerning bonus outcomes shown in the table below.

101
UNIT5
Discrete Probability Distribution

Table 5.11:

Bonus Probability

(Rs) Akshay Mahesh

0 0.35 0.20

50,000 0.45 0.25

1,00,000 0.10 0.35

1,50,000 0.10 0.20

Summarize the payment plan concerning each executive's probability distribution.

a) Compute the expected values to evaluate payment plans for Akshay and Mahesh.
b) Help Akshay and Mahesh to decide whether to choose Option 1 or Option 2 for
their salary.

5.10 SELF-ASSESSMENT QUESTIONS

Q.1 For each of the following random variables indicate whether the variable is discrete or
continuous and specifies the possible values that it can assume.
a. X=thenumber ofthe wrong calls received ona given day.

b. X=the amount of money lost ina month bya randomly selected gambler.
c. X=the average number of customers ina shop in an hour.

d. X=thesurvey ofthe number of customers out of 10 who own an Audi.

e. X=thetimein minutes required to get served inthe passport office.

Solution:
a) discrete; x=0, 1, 2, 3,..... b) continuous; 00<x<00

c) continuous; x>0, d) discrete;x=0,1,2,...,10,

e) continuous; x>0
Q.2 The random variable X represent the number of farms per family in a rural area of
Punjab, with the probability distribution: p(x) =0.05x, x= 2, 3, 4,5, or 6.

a) Expressthe probability distribution intabular form.

b) Find the expected number of farms per family, find the variance and standard
deviation of X.
c) Find the following probabilities:

i. P(X>4)
ii. P{X>4)

102
OPMC001
Business Statistics

iii, P(3<X>5)
iv. P(2<X<4)

v. P(X=4.5)
Solution:

a)
xX |2 3 4 5 6

p(x) | 0.10 0.15 0.20 0.25 0.30

b) E(X) = 4.5, o” = 1.75 ando = 1.323

c) 10.75 1.0.55 i1.0.60 iv.0.15 v.0.00

Q.3 Let X represent the number of times a student visits a club in one month. To assume
that the probability distribution of X is as follows:

X |0 1 2 3

p(x) [0.05 0.25 0.50 0.20

a) Find the mean pt and the standard deviation o ofthis distribution.

b) Find the mean and the standard deviation of


Y = 2X- 1.

¢) What is the probability that the student visits the club at least once ina month?

d) What is the probability that the student visits the club no more than twice in a
month?
Solution:

a) Ux = 1.85 and ox = 0.792


b) Uy = 2.70 and oy = 1.584
c} P(1)+P{2}+P(3)=0.95

d) P(0)+ P(1}+P(2)=0.80
Q.4 At DLF Mall the probability distribution of the number of stores shoppers enter is
shown in the table below:
x |o 1 2 3 4
p(x) | 0.05 0.35 0.25 0.20 0.15
a) Find the expected value of the number of stores entered, find the variance and
standard deviation of the number of stores entered.

b) Suppose Y = 2X + 1 for each value of X. Whatis the probability distribution of Y?


c) Calculate the expected value ofY directly from the probability distribution of Y.
103
UNIT 5
Discrete Probability Distribution

d) Use the laws of expected value to calculate the mean of Y from the probability
distribution of X.
e) Calculate the variance and standard deviation of Y directly from the probability
distribution of Y.
f) Use the laws of variance to calculate the variance and standard deviation ofY from the
probability distribution of X.
g) What did you notice about the mean, variance, and standard deviation of Y= 2X +1in
terms of the mean, variance, and standard deviation of X?

Solution:
a) E(X)=2.05, V(X) =1.3475, SD(X) = 1.1688

b)
Y ji 3 5 7 9

P(y) |0.05 0.35 0.25 0.20 0.15

E{Y)=5.10
E(Y) =E(2X+ 1) =2E(X)+1=2(2.05)+1=5.10

oy = 5.39 and
a, = 2.3216

V(Y) =V(2X+ 1) = 4V(X) = 4(1.3475) = 5.39; SD(Y) =/V(X) = {5.39 =2.32


E(Y) = 2E(X) + 1, V(Y) =4V(), and SD(Y) = 2SD(X).

Qs The joint probability distribution of variables X and Y are shown in the table below.
Aviral and Shivani have joined an automobile factory and are in their training period.
Let X denote the number of cars that Aviral will pitch in a month, and let Y denote the
number of cars Shivani will pitch ina month.
Xx

Y 1 2 3

1 0.30 0.18 0.12

2 0.15 0.09 0.06

3 0.05 0.03 0.02

Determine the marginal probability distribution of X.

b) Determine the marginal probability distribution ofY

c) Calculate E(X) and E{Y).

d) Calculate V(X) and V(Y).


Develop the probability distribution of X+Y
104
OPMC001
Business Statistics

f) Calculate E(X + Y) directly by using the probability distribution of X+Y.

8) Calculate V(X+ Y) directly by using the probability distribution of X + Y.

h) Verify that E(X + Y) =E(X) + E(Y).


Verify that V(X + Y) = V(X) + V(Y). Did you expect this result? Why?

ANS:
xX |1 2 3

P(x) |0.50 0.30 0.20

y ji 2 3

P(y) |0.60 0.30 0.10

E(X) = 1.70 and E(Y) =1.50


V(X) =0.61 and V(Y) =0.45

xty [2 3 4 5 6
P(x+y) | 0.30 0.33 0.26 0.09 0.02

E(X+Y)=3.20
V(X+Y)= 1.06
E(X) + E(Y) =1.70+1.50=3.2=E(X+Y)

V(X) + V(Y) = 0.61 + 0.45 = 1.06 = V(X + Y). Yes, since X and Y are independent random
variables.

Q.6 Two balanced dice are rolled simultaneously. LetX is the sum of two dice.

a) Construct the probability distribution.


b) Findthe mean and standard deviation of X.

Q7 A company plans to start the publication of a political magazine in Bihar. A marketing


research company shared the survey for the demand for magazines. Compute the
expected value and standard deviation for the demand of the magazine. Is it advisable
to start the business?

X p(x)
50,000 0.1
60,000 0.25
70,000 0.4
80,000 0.2
90,000 0.05
105
UNIT5
Discrete Probability Distribution

08 Asurvey was conducted to find out how many credit cards a person carries.

No. of credit cards Probability

0 0.05
1 0.35
2 0.25
3 0.2
>=4 0.15

a) Isthisavalid probability distribution?

b) Whatisthe probability of customers carrying at most 1 credit card?


c) Whatisthe probability that a customer is carrying at least 3 credit cards?

d) Find expected value and variance.

Q.9 If a football game is a tie, to decide the winner of the match each team receives an
opportunity of 5 penalty goals. Based on the records the opposing coach believes that
the chances of conversion of all 5 goals are 0.35, chances of conversion of 4 goals are
0.30 chances of conversion of 3 goals, are 0.20, chances of conversion of 2 goals are
0.15, chances of conversion of 1 or fewer goals is 0.

a) Constructthe valid probability distribution.


b) Whatisthe probability of missing at least 2 shots?

c) Whatisthe probability of missing all penalty shots?


d) Whatisthe probability of hitting all penalty shots?
Q.10 A friendly cricket match is organized in the Ferozshah Kotla Stadium, to help the flood-
affected areas. The aimis to sell 10,000 tickets @ Rs 1,000. To attract the audience, the
organizers have announced a lucky draw. The lucky draw winner would get a cash prize
worth Rs 20,000. If a person purchases two tickets, what are his chances of winning
the draw?
Q.11 For the patients over the age of 60, if there are infected by Corona Virus, the
probability of their survival is 60 percent.

a) If20 patients are admitted to the hospital what is the probability of no deaths?
b) Probability of 5 or fewer deaths.

c) Probability of 8 or more recovery.


d) Whatisthe expected number of survivals?

e) Find the standard deviation.


f) Whatisthe shape of distribution?
106
OPMCO001
Business Statistics

Q.12 Inthe Burger King outlet, half of the customer’s order vegetarian burgers. NOTES
a) Whatis the probability that none of the next 5 customers will order a vegetarian
burger?

b) Atleast two customers order a vegetarian burger.


c) Atmosttwo customers order a vegetarian burger.
d) Construct the probability distribution and describe its shape.
Q.13 Aproduction manager at Maruti Suzuki is interested to find if there are manufacturing
defects in the engine. For the records, there is onlya 1 percent chance of defects in the
engine. From a sample of 35 cars, what is the probability

a) Thereare more than 2 cars have defected engine.

b) Noneofthecars has a defected engine.


c) Atmost1carhasa defected engine.

d) Whatisthe expected number of defective engine?


e) Findits standard deviation.

f) Whatisthe shape of distribution?


Q.14 Asurvey was conducted among working women to find how they maintain a work-life
balance. There interesting finding that came out was either they go for short vacations
or plan weekend trips with their family. The probability of 15 days of vacation time was
0.25, the probability of 7 days’ vacation time was 0.45 and the probability of weekend
outing was 0.3. suppose 15 women executives were interviewed.

a) Whatisthe probability that 10 women executives go on for 15 days of vacation?


b) Whatisthe probability that at least 5 women executives go on for weekend trips?

Q.15 According to the National Cancer Registry Programme of the Indian Council of Medical
Research (ICMR), more than 1300 Indians die every day due to cancer.
a) Whatisthe probability that at least 1100 people die due to cancer every day?
b) Whatisthe probability that at most 800 people die due to cancer every day?

107
THE NORMAL DISTRIBUTION
AND OTHER CONTINUOUS DISTRIBUTIONS

STRUCTURE
6.0 Objectives
6.1 Introduction
6.2 Continuous Distribution
6.3. Normal Distribution
6.4 Evaluating Normality
6.5 TheUniform Distribution
6.6 TheExponential Distribution
6.7. TheNormal Approximation to the Binomial Distribution
6.8 LetUsSumUp
6.9 Self-Assessment Questions
6.10 Answers to Self-Assessment Questions

6.0 OBJECTIVES

After reading this unit, you will be able to:


® understand concepts of continuous distribution
® recognize the situations where uniform distributions occur

e identify the importance of the normal distribution


® useofnormal distribution to solve concrete business problems

e the best fitin an exponential distribution


e demonstrate the normal distribution approximation to the binomial distribution

6.1 INTRODUCTION
This chapter will introduce you to three continuous distributions, the uniform distribution,
the normal distribution, and the exponential distribution. The text is prepared to keep in
mind that you should be able to utilize the same to solve different practical problems related
to continuous curves. Examples and practice problems will be provided to you to
supplement his knowledge. The chapter will include some exercises, to make it easier for
108
OPMCO001
Business Statistics

you to understand all aspects of continuous curves. NOTES

6.2 CONTINUOUS DISTRIBUTIONS


INTRODUCTION TO CONTINUOUS DISTRIBUTIONS

A random variable (x) is said to have a continuous probability distribution if the probability
distribution of all the values of x is defined within a specified interval. The most commonly
used distribution is a well-known example of the same - the normal distribution.
Theoretically, it is always useful to use probability distributions as their properties and
characteristics are well-known by now.
PROBABILITY DISTRIBUTIONS OF CONTINUOUS VARIABLES

A continuous random variable takes all possible values in a given interval, and probability
space is defined on it. It is not possible to measure all the points between any two possible
values of the continuous variable. Therefore, calculus can be utilized to find the probabilities
inalogical sense without physically measuring the same.

6.3 NORMALDISTRIBUTION

The Normal distribution is a type of continuous distribution and the most commonly used
distribution in statistics. Many real-life variables follow the characteristics of a normal
distribution, such as weight, height, length, speed, etc. Characteristics of a normal
distribution are also identified in living things, such as trees, animals, insects, etc.

Different combinations of the parameters mean |. and variance o 2 can create many normal
distributions, with the same basic shape. Therefore, in most business situations, instead of a
normal distribution, a standard normal distribution is applicable.
A standard normal distribution has a mean of0 and a standard deviation of 1. The standard
normal distribution can be obtained by subtracting original values from its mean and
dividing by the standard deviation.
Irrespective of mean and variance, any normal distribution, has the following characteristics;

a Distribution is symmetric

b. Distribution is uni-modal
c. Distribution has a continuous range from —° to +9

d. In anormal distribution, the total area under the curve is unity

e. Allaverages coincide (mean, median, and mode are equal)

As mentioned earlier, there can be many normal distributions with these characteristics
given above. Therefore, standard normal variation is applied in most business situations
instead of normal distributions.
The methodology of this transformation is described below.

The second point which needs to be understood is that the probability that X is exactly equal
to some value is always close to zero because the area under the curve at a single point,
which has no width, is zero. We can calculate a non zero probability that a man weighs more 109
UNITS
The Normal Distribution and
Other Continuous Distributions

or less than a fixed amount, but the probability that he is exactly equal to the value is
infinitesimally small in a continuous distribution because of the large range of values.
The normal distribution is described or characterized by two parameters: the mean, p, and
the standard deviation, o. The values of these produce a normal distribution. The Density
Function of the normal distribution is given as:

_ _,.nFy
where
f@)= TR
p= mean of x
o= standard deviation of x
m= 3.14159..., and e = 2.71828...
We can calculate this density function for different combinations of and o. It can be
observed that for different combinations we will get different normal distributions. Thus, to
obtain normal probabilities, we use standardized normal variable Z, obtained by converting
a normally distributed variable. Using this transformation and a normal probability table, we
can avoid the tedious computations that appear in the density function above.
Standardized normal variable Z can be computed as;
x—U
Z= —
a
We can illustrate the transformation usingz (Fig. 6.2) where 1=8 and o=2

Fig. 6.2: Transformation of Scales

H-36 H-26 16 +10 p+20 p4360 Upto 3 oLimit


or

2 4 6 10 12 14 ‘| XScale (p=8 and o=2)


3 -2 -1 0 +1 +2 +3 _ | ZScale (y=0 and o=1)

Example:
Suppose, in the final exam of the statistics class of students have got a mean score of 70 anda
standard deviation of 20. Calculate the probability that a student selected randomly from
110
OPMC001
Business Statistics

class, has scored marks greater than 80.


P(X>80) =?

Solution
First, convert variable X into a standard variable Z. The z score can be calculated for this
problemas follows;
x—-# _80-70_10_ 05
Z= =—= 0.
o 20 20

This will provide a new variable: the value which denotes the number of standard deviations
of the old variable lying between the mean and the original variable (80). This, from the Z-
table below, (intersection of 0.5 horizontally and 0.0 vertically), is 0.1915, which is the area
between the mean and the point in question i.e. X = 80. We have obtained the probability
between the mean (70) and (80), but to know the probability greater than 80, we need to
subtract from. Rs 5,000 (one half of the curve is .5). Therefore, the probability of X > = 80
would be 0.3065. In case, we need to know the percentage less than 80, 1 has to be
subtracted from the above value, which comesto 0.6915.

111
UNIT &
The Normal Distribution and
Other Continuous Distributions

NOTES Areas Under the One-Tailed Standard Normal Curve

\ ad

_7 | | \
_ —=—- cE

The “2” Distribution Curve

Zz 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 | 0.0040 | 0.0080 | 0.0120 | 0.0160 | 0.0199 | 0.0239 | 0.0279 | 0.0319 | 0.0359
0.1 0.0398 | 0.0438 | 0.0478 | 0.0517 | 0.0557 | 0.0596 | 0.0636 | 0.0675 | 0.0714 | 0.0753
0.2 0.0793 | 0.0832 | 0.0871 | 0.0910 | 0.0948 | 0.0987 | 0.1026 | 0.1064 | 0.1103 | 0.1141
0.3 0.1179 | 0.1217 | 0.1255 | 0.1293 | 0.1331 | 0.1368 | 0.1406 | 0.1443 | 0.1480 | 0.1517
0.4 0.1554 | 0.1591 | 0.1628 | 0.1664 | 0.1700 | 0.1736 | 0.1772 | 0.1808 | 0.1844 | 0.1879
0.5 0.1915 | 0.1950 | 0.1985 | 0.2019 | 0.2054 | 0.2088 | 0.2123 | 0.2157 | 0.2190 | 0.2224
0.6 0.2257 | 0.2291 | 0.2324 | 0.2357 | 0.2389 | 0.2422 | 0.2454 | 0.2486 | 0.2517 | 0.2549
0.7 0.2580 | 0.2611 | 0.2642 | 0.2673 | 0.2704 | 0.2734 | 0.2764 | 0.2794 | 0.2823 | 0.2852
0.8 0.2881 | 0.2910 | 0.2939 | 0.2967 | 0.2995 | 0.3023 | 0.3051 | 0.3078 | 0.3106 | 0.3133
0.9 0.3159 | 0.3186 | 0.3212 | 0.3238 | 0.3264 | 0.3289 | 0.3315 | 0.3340 | 0.3365 | 0.3389
1.0 0.3413 | 0.3438 | 0.3461 | 0.3485 | 0.3508 | 0.3531 | 0.3554 | 0.3577 | 0.3599 | 0.3621
1.1 0.3643 | 0.3665 | 0.3686 | 0.3708 | 0.3729 | 0.3749 | 0.3770 | 0.3790 | 0.3810 | 0.3830
1.2 0.3849 | 0.3869 | 0.3888 | 0.3907 | 0.3925 | 0.3944 | 0.3962 | 0.3980 | 0.3997 | 0.4015
1.3 0.4032 | 0.4049 | 0.4066 | 0.4082 | 0.4099 | 0.4115 | 0.4131 | 0.4147 | 0.4162 | 0.4177
1.4 0.4192 | 0.4207 | 0.4222 | 0.4236 | 0.4251 | 0.4265 | 0.4279 | 0.4292 | 0.4306 | 0.4319
1.5 0.4332 | 0.4345 | 0.4357 | 0.4370 | 0.4382 | 0.4394 | 0.4406 | 0.4418 | 0.4429 | 0.4441
1.6 0.4452 | 0.4463 | 0.4474 | 0.4484 | 0.4495 | 0.4505 | 0.4515 | 0.4525 | 0.4535 | 0.4545
1.7 0.4554 | 0.4564 | 0.4573 | 0.4582 | 0.4591 | 0.4599 | 0.4608 | 0.4616 | 0.4625 | 0.4633
1.8 0.4641 | 0.4649 | 0.4656 | 0.4664 | 0.4671 | 0.4678 | 0.4686 | 0.4693 | 0.4699 | 0.4706
1.9 0.4713 | 0.4719 | 0.4726 | 0.4732 | 0.4738 | 0.4744 | 0.4750 | 0.4756 | 0.4761 | 0.4767
2.0 0.4772 | 0.4778 | 0.4783 | 0.4788 | 0.4793 | 0.4798 | 0.4803 | 0.4808 | 0.4812 | 0.4817
2.1 0.4821 | 0.4826 | 0.4830 | 0.4834 | 0.4838 | 0.4842 | 0.4846 | 0.4850 | 0.4854 | 0.4857
2.2 0.4861 | 0.4864 | 0.4868 | 0.4871 | 0.4875 | 0.4878 | 0.4881 | 0.4884 | 0.4887 | 0.4890
2.3 0.4893 | 0.4896 | 0.4898 | 0.4901 | 0.4904 | 0.4906 | 0.4909 | 0.4911 | 0.4913 | 0.4916
2.4 0.4918 | 0.4920 | 0.4922 | 0.4925 | 0.4927 | 0.4929 | 0.4931 | 0.4932 | 0.4934 | 0.4936
2.5 0.4938 | 0.4940 | 0.4941 | 0.4943 | 0.4945 | 0.4946 | 0.4948 | 0.4949 | 0.4951 | 0.4952
2.6 0.4953 | 0.4955 | 0.4956 | 0.4957 | 0.4959 | 0.4960 | 0.4961 | 0.4962 | 0.4963 | 0.4964
2.7 0.4965 | 0.4966 | 0.4967 | 0.4968 | 0.4969 | 0.4970 | 0.4971 | 0.4972 | 0.4973 | 0.4974
2.8 0.4974 | 0.4975 | 0.4976 | 0.4977 | 0.4977 | 0.4978 | 0.4979 | 0.4979 | 0.4980 | 0.4981
2.9 0.4981 | 0.4982 | 0.4982 | 0.4983 | 0.4984 | 0.4984 | 0.4985 | 0.4985 | 0.4986 | 0.4986
3.0 0.4987 | 0.4987 | 0.4987 | 0.4988 | 0.4988 | 0.4989 | 0.4989 | 0.4989 | 0.4990 | 0.4990
3.1 0.4990 | 0.4991 | 0.4991 | 0.4991 | 0.4992 | 0.4992 | 0.4992 | 0.4992 | 0.4993 | 0.4993
3.2 0.4993 | 0.4993 | 0.4994 | 0.4994 | 0.4994 | 0.4994 | 0.4994 | 0.4995 | 0.4995 | 0.4995
3.3 0.4995 | 0.4995 | 0.4995 | 0.4996 | 0.4996 | 0.4996 | 0.4996 | 0.4996 | 0.4996 | 0.4997
3.4 0.4997 | 0.4997 | 0.4997 | 0.4997 | 0.4997 | 0.4997 | 0.4997 | 0.4997 | 0.4997 | 0.4998
3.5 0.4998 | 0.4998 | 0.4998 | 0.4998 | 0.4998 | 0.4998 | 0.4998 | 0.4998 | 0.4998 | 0.4998
3.6 0.4998 | 0.4998 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999
3.7 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999
3.8 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999 | 0.4999
3.9 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000 | 0.5000

For a given normally distributed data, we can use Excel to calculate the standard deviation
and the mean. If the mean and the standard deviation are available, we can use the function
NORM. DIST to calculate the probability distribution of data. This functions can be used as per
the following formula. NORM.DIST (x value, mean, standard deviation, True).
112
OPMCO001
Business Statistics

Mean= 70
$.D.= 20

P(X>80)= |1- P(X<80]=NORM. DIST(C1,C2,C3, TRUE)


0.308538| |
Function Arguments ? x

NORM.DIST

Standard_dev = 20
Cumulative | TRUE Ea] = TRUE
= 0,691462461
Returns the normal distribution for the specified mean and standard deviation.

Cumulative ‘is a logical value: for the cumulative distribution function, use TRUE: for
the probability density function, use FALSE.

Formula result = 0,691462461

Help on this function

6.4 EVALUATING NORMALITY

Normal distributions depict similar shapes. To understand the distribution of data in a


normal distribution following empirical rule, the following thumb-rules have to be borne in
mind:
® Around 68%of the datalies between +1 olimit
e Around 95%ofthe datalies between w+ 2 olimit
e Around 99% of the data lies between 1+ 3 olimit

6.5 THE UNIFORM DISTRIBUTION

In continuous distributions, we know that probabilities cannot be calculated for a single


point. If the probabilities are equally distributed overa given interval and follow a flat line, to
determine the probabilities of x between a and b, the density function of uniform
distribution can be written as:
x2—x1
P(x) =>
= ————_

where:
a<xl<x2<b

The mean and standard deviation of distribution are as follows:

a+b
Mean =

b-a
Standard Deviation =
12 113
UNIT &
The Normal Distribution and
Other Continuous Distributions

NOTES Example:
Suppose, it takes around 20 to 40 minutes to complete a particular process and process time
is uniformly distributed. Calculate the probability that a process may take 25 to 30 minutes.

P(25< x<30)= 30-25=—


5= 0.25
40-20 20

Hence, the chances that the process will be get completed in 25 to 30 minutes are 25%.

6.6 THEEXPONENTIAL DISTRIBUTION

It is one of the important distributions among all the continuous probability distributions. To
understand the distribution of the times between random occurrences, we utilize
exponential probability distribution. Some of the characteristics of this distribution are:
e The exponential distribution is continuous and is a family of distributions.

@ Theexponential distribution is generally skewed to the right.


e = Inthis distribution, values of
x can vary from zero to infinity.

e x=Oisthe apex ofthe distribution.

e Asthevalue ofxincreases, the curve decreases gradually.


The density function of an exponential probability distribution is:
f(z)=re™
where
x> 0, »>0 and e = 2.71828 ...
Distribution has only one parameter, ». {which is the inverse of the mean). For a different
value >, we obtain a different exponential distribution, leading to a family of exponential
distributions.

Suppose, arrivals at a ticket counter are Poisson distributed with 3 customers every minute.
What is the probability of an interval of 2 minutes or more would be between arrivals?

Solution:
>» is always the reverse of the mean, in this case, =1/3

In the instance above, the probability is calculated as the sum of all probabilities of x > 2,
which translates to % customers per minute, and the problem to probabilities of less than
% customer per minute.

which is =1-e**-Xx i.e. =1-e**-1/2*1/3=1-e**-1/6= 1-0.153518=0.85


There are around 85 % chances that interval of 2 minutes or more would be between arrivals

For a given Poisson distributed data, we can use Excel to calculate the probability of time
elapsed between two arrivals. This function can be used as per the following format.
114
OPMCO001
Business Statistics

EXPON.DIST (x value, arrival rate, standard deviation, True)


Fig. 6.5: Use of EXPON.DIST Function Calculates The Probability Distribution

| expon.vist~| 7 | x wv & | =EXPON.DIST(2,3,TRUE)


4 a | 8 (em E F G H ih
11 |x=2
_2|Lamda=3)
| 3. | probability of an interval of 2 minutes or more would be between arrivals 0.997521
=} Function Arguments ? x
EXPON.DIST

nt 2
Cumutative = TRUE
= 0,997521248
£1} Returns the exponential distribution.
Cumulative is a logical value for the function to return: the cumulative distribution
function = TRUE; the probability density function = FALSE.

Formula result = 0,997521248

VW Help on this function | Ganee


cel

6.7. THE NORMAL APPROXIMATION TO THE BINOMIAL DISTRIBUTION


A Binomial Distribution is defined as a frequency distribution of the possible number of
successful outcomes (where the possibility is either success or failure) in a given number of
trials in each of which there is the same probability of success (p). The problem is to
determine the probability (p) of obtaining the desired number of successes in several trials.
For example, if we toss a coin, the probability of obtaining a head or tail is 0.5. However, if we
toss it 4 times, we need to calculate the probability of obtaining 2 heads or 2 tails.
P(x:n,p) = nCx px (1-p)n-x

where,

n=the number of experiments

x=0,1,2,3,4,...
p=Probability of Success ina single experiment
q = Probability of Failure ina single experiment = 1—p

The binomial distribution formula can also be written in the form of n-Bernoullitrials, where
nCx=n!/x!(n-x)!

Hence, P(x: n, p) =n!/[x!(n-x}!].px.(q)n-x

Binomial distribution problems can be approximated to the normal distribution.


If sample sizes increase, for any value of p, the shape of binomial distributions starts
approaching a normal distribution. For a small value of n and value of p around .5, the speed
of approximation to the normal distribution is higher. If the sample size and value of p are
both low, then distribution gets skewed to the right. For a sufficiently large value of n (sample
115
UNIT &
The Normal Distribution and
Other Continuous Distributions

NOTES size)the normal distribution is a good approximation for a binomial distribution. A thumb
rule states that if n, p, and n.(1-p) are both greater than or equal to 10, the normal
distribution is appropriate.
Approximation from binomial to the normal curve, requires conversion of the two
parameters of a binomial distribution, n, and p, to the parameters of the normal distribution,
Land oare given below:
w=n.p ando=./npq, where q=1-p
Once new parameters are obtained, there is a test to determine whether the normal
distribution approximation of the binomial distribution is done successfully. This test
ensures 99.7% accuracy of the empirical results. If all possible x values fall between O and n, it
depicts the acceptability of a Normal approximation of a binomial distribution.

Suppose:

P(x2>20 | n=50and p=.4)=?

It can be translated to normal by calculating new parameters as:


=n. p=(50)(.4) =20 and o=./(n.p.q)=3.46, where q=1-p
Now, the binomial problem can be converted to normal as:
P{x220 | p= 20 and o= 3.46)
Normal curve fitting can be tested using + 30 =20 + 3 (3.46) = 20+ 10.39. Therefore, the range
9.61 s wt 30 $ 30.39, can be considered as a successful approximation to use the normal
curve.

6.8 LETUSSUM UP
The chapter included important formulas, concepts, distributions, and solution of the
problems, also using excel worksheets. Starting with the continuous random variable and
distribution, the chapter describes Normal, Uniform, Exponential Distribution, and Normal
Approximation to the Binomial Distribution along with suitable examples. In this chapter, we
have discussed the application of standard normal random variables, how to determine the
mean and standard deviation of a normal random variable. We also understood the real-
world applications of Normal, Uniform and Exponential Distribution. Also, the calculations
and use of Normal Approximation to the Binomial Distribution is discussed.

6.9 SELF-ASSESSMENT QUESTIONS

Q.1 Suppose binomial distribution with n fixed, if p>0.5, then:


(a) Itprovides asymmetric binomial distribution.
(b) The exponential distribution is the worst approximation.

(c) Itisaleft-skewed binomial distribution.


(d) tis aright-skewed binomial distribution.
116
OPMC001
Business Statistics

Q.2 Which of the statement is true for the probability distribution of the random variable?
(a) Aprobability is distributed among a range of possible values.
(b) Thesumofall probabilities of all outcomes is 1.
(c) Inanycase no probability value occurs more than once.

(d) (a)and(b)
Q3 Select one statement which matches with a given situation. Suppose p = 0.3, if a
binomial expression appearsas 3; 8 31 (0.3)° (0.7)’means that:
(a) Probability of getting exactly three successes in sixtrials.
(b) Probability of getting exactly four successes in six trials.
(c) Probability of getting three or more successes in seven trials.
(d) Probability of getting four or more successes in seven trials.
0.4 Uniformly distribution is followed by random variable x over the interval 9 to 15, then
distribution height can be obtained as
a) 1/9
b) 1/6
c) 1/15

d) 1/24
Q5 What is the mean of the distribution of x, when it is uniformly distributed and lies
between the interval 9to 15?
a) 12
b) 15
c) 9
d) 6
6.10 ANSWERS TO SELF-ASSESSMENT QUESTIONS

Q.1 c
Q.2 d
Q3 a
Q.4 b
Q5 a

117
SAMPLING DISTRIBUTIONS

STRUCTURE
7.0 Objectives
7.1 ‘Introduction
7.2 Sampling distribution
7.3. Sampling distribution of the Mean
7.4 Sampling Distribution of the proportion
7.5. Determiningsample size
7.6 LetUsSumUp
7.7. Self-Assessment Questions
7.8 Answersto Self-Assessment Questions

7.0 OBJECTIVES

After reading this unit, you will be able to:


understand the concepts ofa sampling distribution

recognize the distribution of asample’s mean using the central limit theorem
calculate z-value for the distribution of asample’s mean and sample proportions
estimate a suitable sample size in business situations

understand the concept of confidence intervals

7.1 INTRODUCTION

This chapter will introduce you to concepts of the sampling distribution. The ultimate
objective is to enable the student to conceive the central limit theorem, distribution of a
sample’s mean, and sample proportions. Examples and practice problems are provided at
the end for solving. The chapter will include some exercises, which will allow you to calculate
the distribution of a sample’s mean, sample proportions, and appropriate sample size in
given situations.

118
OPMCO001
Business Statistics

7.2 SAMPLING DISTRIBUTION

The probability distribution of a statisticis known as a sampling distribution.

If we draw all possible samples of size n from a given population and we compute a statistic
(e.g., a mean, proportion, standard deviation) for each sample, then the probability
distribution of this statistic is called a sampling distribution.

If we select a single random sample ofa predetermined size from the population, to use the
sample statistic to estimate the population parameter, there is a certain probability of
obtaining the same value when the entire population is evaluated. This distribution of
probabilities among all possible values the statistic may take when computed from random
samples of the same size, drawn from a specified population, is called the sampling
distribution.

7.3. SAMPLING DISTRIBUTION OF THE MEAN (X)

The sample mean is a random variable and its values depend on the possible values of the
elements in the random sample which is drawn to compute and the distribution of the
population from which it is drawn. Asample mean has a probability distribution.

Example:
The sample mean is one of the more common statistics used in the inferential process. Let us
try to derive the sampling distribution in the simple instance of drawing a sample of size n=2
items from a population uniformly distributed over the integers 1 to 6.
Using an Excel-produced histogram, we can see the shape of the distribution of this
population
of data.

Fig. 7.1: Shape of the distribution of this population of data

12 5
1 4

0.8
Frequency

0.6
l

0.4
1

0.2
0 - T T T T T T 1
1 2 3 4 5 6

Value
of X

119
UNIT 7
Sampling Distributions

Table 7.1: Possible Values of Two Sample Points from a Uniform Population
of the Integers 1 to 6

1,1 1 3,1 4,1 5,1 6,1


1,2 2,2 3,2 4,2 5,2 6,2
1,3 2,3 3,3 4,3 5,3 6,3
1,4 2,4 3,4 4,4 5,4 6,4
1,5 2,5 3,5 4,5 5,5 6,5
1,6 2,6 3,6 4,6 5,6 6,6

Table 7.2: Mean of Possible Values of Two Sample Points from a


Uniform Population of the Integers 1 to 6

1 1.5 2 2.5 3 3.5

1.5 2 2.5 3 3.5 4

2 2.5 3 3.5 4 4.5

2.5 3 3.5 4 4.5 5

3 3.5 4 4.5 5 5.5

3.5 4 4.5 5 5.5 6

Fig. 7.2: Shape of the Distribution of sample means


N
UO
Frequency
OorPNWH

115 2 25 3 35 4 45 5 55

Sample Means

We can see that shape of the population is quite different from the shape of the sample
distribution.

120
OPMCO001
Business Statistics

Example: NOTES
Let us consider the example discussed above in Normal distribution. In the final exam of a
statistics class of students have got a mean score of 70 and a standard deviation of 20? If we
randomly select a group of 16 students from class, then what is the probability that the
average score of the selected students is greater than 80?

P(X >80)=?

Solution

This problem calls for determining the area of the upper tail of the distribution. The z-score
for this problemis

x-p 80-70 10

Now, the standard deviation of the sample is different. The approximate standard
distribution of the sample is the standard deviation of the population divided by the square
root of the sample size.
This “z” value is different from that derived from the entire population because the standard
deviation is divided by the square root of the sample size.

Using the z table, we arrive at a probability of 0.4772 (intersection of 2.0 and 0). This is the
probability of obtaining a value of less than 80. To derive the probability greater than 80, we
need to the probability value of .4772 from .5000, (as before), and thus the probability of the
population is greater than 80 is .0228. Hence, there is around a 2 % probability that the score
of aselected student is greater than 80.

The Central Limit Theorem

As the sample size increases the sampling distribution of the mean tends to the normal. It is
known as the central limit theorem. The shape of the population may be anything, but this
holds for all distributions. It is one of the most important rules in statistics.
If samples of size n are drawn randomly from a population that has a mean panda standard
deviation o, the sample meansX is approximately normally distributed for sufficiently large
sample sizes (n > 30). If the population is normally distributed, the sample means are
normally distributed for any size sample.

Mathematically can be written as:


H,=H (Meanofthesamplemeans)and o;= = (standard deviation of the sample means)

7.4 SAMPLING DISTRIBUTION OF THE PROPORTION

Suppose we are interested in knowing what proportion of people in a sample have a


preference for a car of a particular colour. The sample proportion is also known as the
121
UNIT 7
Sampling Distributions

statistic of choice. The sampling distribution of the sample proportion is based on the
binomial distribution with parameters n and p, where n is the sample size and p is the
population proportion.

As the sample size n increases, the sampling distribution of as per application of the Central
Limit Theorem, approaches a normal distribution with mean p and standard deviation

JPQ-—p)/n.

The sample proportion is computed by dividing the frequency with which a given
characteristic occurs ina sample by the number of items inthe sample.

P=x/n

where

x= number of itemsin a sample that have the characteristic


n=number of items inthe sample

Example:
Arecent survey in ABC Company suggests that 80% of employees consider salary as the most
important component of job satisfaction. Suppose a random sample of 100 employees is
selected, what is the probability that more than 90% of the employees consider salary as the
most important component of job satisfaction?
Answers:

___b-p __ 09-08 _01_,


Jpd—p)/n f0.8(0.2)/100 0.04
Using the Normal probability table(z table), the probability value at z = 2.5 is .4938 and the
area under the normal curve greater than 2.23 is 0.0062 (.5000-.4938). Therefore, inthe case
where the population proportion is 0.80, the probability is less than 1% that more than 90%
of the employees consider the salary as the most important component of job satisfaction.

7.5 DETERMINING SAMPLE SIZE

In case the population is very large and difficult to compute, we study a portion of the
population. It is unrealistic to study the entire population in most situations due to economic
constraints, time constraints, and other limitations. Two important examples are predicting
election results and market surveys conducted by companies. The issue here is: how large
should a sample be? To estimate the sample size in such situations, we must know the
answers to the following three questions:

1. Howmucherrorcan we tolerate?

2. Whatisthe desired confidence level?


122
OPMCO001
Business Statistics

3. Whatisthe variance of the population under study? NOTES


Once we have answers to the above questions, the size of the sample can be determined by
using the z formula for the sample means to solve for n. Consider Z =x -plo iVn
In case o is unknown, it can be estimated as 1/4" of the range,. Let E=(x-y) the error of
estimation resulting from the sampling process. Substituting E into the preceding formula as;
Z=ElolVn
Solving for n results in a formula that can be expressed as:

n -(7«22 i
E

Example
Suppose we want to estimate the average income of people in a particular region where
income ranges from $1,000 to $10,000. We want to be 95% confident and estimate is to be
based on recent data with the actual figure from the sample. If we can tolerate the error of
+5100, then what should be the sample size to estimate average income?

Solution

Here, E = $100, confidence interval required 95%, having z value 1.96, and o is unknown, so
we can estimate the same as (1/4) (range). In this case range is $1,000 to $10,000, therefore
estimate for o=(1/4} (9,000) =2,250.

2 2

Thus, 7= Lane -{ 9G.) = 1,944.81 - the sample size


E 100

Now, we have from the above calculation, estimated that we need to have a sample size of
around 1945.

Constructing a confidence interval

The confidence interval is an estimate of the amount of uncertainty associated with a


sample, computed from a mix of sample size, proportion determined in the sample and the
confidence interval. Sample estimations are at best estimations on the population and can
never be exact. Therefore, a degree of confidence is computed around the figures computed
around the sample.
Aconfidence interval of 95% implies az value of 1.96.

The formula for determining


the upper intervals are:

Cl=ptzx/px(1-6)/n
where,

Cl=confidence interval
p=sample proportion 123
UNIT7
Sampling Distributions

NOTES z: from z-score table, e.g.1.96 for 95 % confidence


n: sample size

Thus, for a proportion of 0.5, a confidence level of 95 %, and a sample size of 1280, the result
is:

Cl=0.541.96 x /0.5x0.5/1,280

7.6 LETUSSUMUP

This chapter includes important formulas and concepts of a sampling distribution and
determining estimates within samples as well as sample sizes. Starting with sampling
distribution, central limit theorem, the chapter also describes the distribution of a sample’s
mean and sample proportions along with suitable examples. In this chapter, we have
discussed howto determine sample sizes in a real business situation.

7.7. SELF-ASSESSMENT QUESTIONS


Q.1 Standard deviation sample statistics is knownas:
a. Continuous probability distribution
b. Sampling distribution
Ss Frequency distribution
d. Standard normal distribution

Q.2. The variance of the sample mean can be calculated as:


a. Population variance divided by sample size.
b. Population standard deviation divided by sample size.
& Population variance divided by sample mean.
d. Population standard deviation divided by sample mean.
Q.3 Select a statement that is true for the distribution of the sample mean that follows a
normal distribution.
a. Irrespective of the distribution of the population of a selected random sample.
b. Onlyifthe sample is selected from a normal distribution.
c. The distribution of the sample mean remains the same as the distribution of the
population of its origin.
d. Population distribution needs to be a continuous distribution.
Q.4 To get the precise result using the central limit theorem, the minimum sample size
should be
a. 20
b. 50
124
OPMCO001
Business Statistics

c. 30

d. 100
Q.5 Suppose a sample of a size of 36 is drawn from a population with a mean of 60 anda
standard deviation 12. What is the expected value of the sample means?
a. 0.6
b. 60
c. 36
d. 12

7.8 ANSWERS TO SELF-ASSESSMENT QUESTIONS


Q1 b
Q.2 a
Q3 a
Q.4 c
Q5 b

125
UNIT7
Sampling Distributions

FUNDAMENTALS OF HYPOTHESIS TESTING:


ONE-SAMPLE TESTS

STRUCTURE
8.0 Objectives
8.1 Introduction
8.2. Fundamentals of Hypothesis - Testing Methodology
8.3 t-Test of Hypothesis forthe Mean (sigma Unknown)
8.4 Onetailtests
8.5 Ztestof hypothesis forthe proportion
8.6 LetUsSumUp
8.7 Self-Assessment Questions
8.8 Answersto Self-Assessment Questions

8.0 OBJECTIVES
After reading this unit, you will be able to:
® understand concepts of hypothesis testing

e estimate the population mean when population standard is unknown


® conduct one tail test

e® estimate a population proportion using the z statistic

8.1 INTRODUCTION

This unit will introduce you to the basic concepts of hypothesis like the null hypothesis,
alternate hypothesis, level of significance, type of test, etc. At the end of the unit, you will be
able to estimate the population mean when population standard deviation is unknown. Also,
the unit will provide an understanding of one tail test in a different scenario. The chapter will
include some exercises, which will allow you to observe different situations to conduct
testing of the hypothesis.

There are two ways to draw inferences about a population using sample data. First is by
calculating a point estimate of a population parameter and then form a confidence interval
around this point estimate. Secondly, a researcher often has a particular theory, or
126
OPMCO001
Business Statistics

hypothesis, that he or she would like to test. The hypothesis might be on a new machine NOTES
working properly, or requiring replacement, product sales in new packaging design will be
more or less than the current design, and so on. In the latter case, the researcher collects
relevant sample data and checks whether the data provide sufficient evidence to support the
hypothesis.

The hypothesis that the researcher is attempting to prove is called the alternative Hypothesis
or research hypothesis. The other opposite hypothesis of maintaining the status quo is called
the null hypothesis.

8.2. FUNDAMENTALS OF HYPOTHESIS - TESTING METHODOLOGY

Hypothesis testing is fundamental to inferential statistics because it allows us to use


statistical methods to arrive at decisions on real-life problems using samples. There are
conceptual steps involved in hypothesis testing:
1, Develop a research hypothesis that can be tested mathematically.
2. Formally state the null and alternative hypothesis.

3. Decide on an appropriate statistical test and perform the calculations.


4. Arrive atadecision based onthe results.

Let’s take the example of a vending machine to supply coffee (i.e 100 ml per cup). Suppose
there is a complaint that the machine is supplying less than 100 ml. The manufacturer wants
to test this hypothesis. In this case, the null and alternative hypotheses could be formally
stated as:

H,:>100

H,:< 100

H, is called the null hypothesis. In this example, the null hypothesis states that the vending
machine is supplying 100 ml or more. H,, sometimes written as H,, is called the alternative
hypothesis: in this case, the alternative hypothesis is that the vending machine is supplying
less than 100 ml. Note that the null and alternative hypothesis must be both mutually
exclusive (no results could satisfy both conditions) and exhaustive (all possible results will
satisfy one of the two conditions). In this example, the alternative hypothesis is single-tailed:
we state that the vending machine is supplying less than 100. Whether a test is single-tailed
or two-tailed depends on the alternate hypothesis.

We could also state a two-tailed alternative hypothesis if that is more appropriate to our
research objective. If the manufacturer is interested in whether the vending machine is
working properly, then the testing would consider both possibilities, higher or lower, and we
would state this usinga two-tailed alternative hypothesis:
H,:H=100

H,: %# 100
127
UNIT &
Fundamentals of Hypothesis
Testing: One-Sample Tests

NOTES Normally the first two steps would be performed before the experiment is designed or
the data collected; in such a situation, the statistic to be used for hypothesis testing is also
specified at this time or is implicit in the hypothesis and type of data involved. We then
collect the data and perform the statistical calculations, in this case probably a t-test, and
based on our results make one of two decisions:

e Rejectthe null hypothesis and accept the alternative hypothesis, or:


e = Fail to reject the null hypothesis.

The first case is sometimes called “finding significance” or “finding significant results.” The
process of statistical testing involves establishing a probability level or p-value (a topic
treated in greater length below) beyond which we consider results from our sample strong
enough to support the rejection of the null hypothesis.

In practice, the p-value is commonly set at 0.05. Why this particular value? It’s an arbitrary
cutoff point and dates back to the early twentieth century when statistics were computed by
hand and the results compared to published tables to determine whether a result was
significant or not.
Alternative lower values are sometimes used, such as p < 0.01or p < 0.001, but no one has
been successful in legitimizing the use of a higher cutoff, such as p< 0.10. Note that failure to
reject the null hypothesis does not mean that we have proven it to be true, only that the
experiment or study did not provide sufficient evidence to reject it.

Inferential statistics allows us to make probabilistic statements about the data, but the
possibility of error is inherent in the process. Statisticians have classified two types of errors
when making decisions in inferential statistics and set levels for error rates that are
commonly considered acceptable.
The two types of error are known as Type | and Type Il Errors. In our professional and personal
lives, we often have to make accept-reject types of decisions based on incomplete data. As
long as such decisions are made based on evidence that does not provide 100% confidence,
there will be possibilities of errors. No error is committed when a good prospect is accepted
ora bad one is rejected. But there is always a possibility that a bad prospect is accepted or a
good one is rejected. Of course, we would like to minimize the possibilities of such errors.
During statistical hypothesis testing, rejecting a true null hypothesis is known as a type |
error, and acceptinga false null hypothesis is known asa type Il error.

Table 8.1: Type | and Type II Errors

H, True H, False
Accept H, No Error Type Il Error

Reject H, Type lError | Ho Error

In hypothesis testing, the major task is to minimize the chances of type | and type II errors.
Unfortunately, it is not possible to minimize both errors. Thus by fixing one of them, the
128
OPMCO001
Business Statistics

other can be minimized. NOTES


In statistical hypothesis testing, one has to establish a significance level, denoted by , and to
reject H, when the p-value falls below it. The standard values are 10%, 5%, and 1%. Suppose
ais setat 5%. This means that whenever the p-value is less than 5%, H, will be rejected.

8.2 t-TESTOF HYPOTHESIS FOR THE MEAN (SIGMA UNKNOWN)


The significance test for the mean of a Normal population is based on estimates of yw
through the sample mean x. The sampling distribution ofx depends ono. If ois known, the z
test for significance of mean can be conducted. However, when o is unknown, we need to
estimate the population standard deviation o as sample standard deviation s.

If we select a simple random sample (SRS) of size n from any normally distributed population
with mean j.and standard deviation o, the sample mean x follows a normal distribution with
mean wand standard deviation o/vn. When gis not known, the estimated sample standard
deviation s will be used to estimate the standard deviation of x by s/n.

Then the one-sample t statistic

t=(X¥— yp) jf s/4/n followsat distribution with n-1 degree of freedom.

To illustrate the use of the t-test for the mean, let us assume that we have sales performance
of a particular location for the last ten weeks as given in Table 8.2. The business objective is
to determine whether the average sale performance has achieved $1,000 for the past ten
weeks, As a manager of the company, you need to determine whether this amount has
changed. In other words, the hypothesis test is used to try to determine whether the average
sales are increasing or decreasing.
Table 8.2: Sales Performance of a Particular Location

Week No. | Sales Performance

1 990

1,050
WwW TN

950
|

975
|

1,025
|O

1,075
Dm

975
SN

999
oa

1,000
Oo

1,100
oS
=

129
UNIT 8&
Fundamentals of Hypothesis
Testing: One-Sample Tests

Solution:
To perform this two-tail hypothesis test, the following steps are given below:

Step 1 Define the following hypothesis:


HO : p= 1000
Hi: #1000
The alternative hypothesis contains the statement needed to be proven. If the null
hypothesis is rejected, then there is statistical evidence that the population - mean i.e the
average sales performance is no longer $1,000. If the statistical conclusion is “do not reject
H,,” then you will conclude that there is insufficient evidence to prove that the mean amount
differs from the long-term mean of $1,000.

Step 2 Data is collected as a sample of n = 10 sales performance of the last ten weeks.
Suppose we decide to use a = 0.05 (level of significance).

Step 3 Because s is unknown, we will use t distribution and the t-STAT test statistic. The
assumption here is that the population of sales performance is approximately normally
distributed to use thet distribution because the sample size is only 10.

Step 4 For a given sample size, n, the test statistic t-STAT follows a t distribution with n—1
degree of freedom. The critical values of the t distribution with 10 - 1=9 degrees of freedom
can be found from the statistical table. But since we are using Excel for testing the
hypothesis, only the p-value needs to be checked for acceptance or rejection of the null
hypothesis. ais already established at 0.05 (level of significance). Thus, to reject Ho when the
p-value should be below it. This means that whenever the p-value is less than 5%, Ho will be
rejected.
As t-test for a single sample is not directly available, therefore t-Test for two-sample
assuming unequal variances is used. Average sales performance is used as a second sample
with the same values in C2: C11.
Fig. 8.1: t-Test for two-sample assuming unequal variances
jc. | H1xX Vv & | Sales Performance

t-Test: Two-Sample Assuming Unequal Variances

2 Input: — —
2 L Range: aa
i5B 4] 975) reg aoe zRanoe
100031 =e

6 3 1025 1000] Hypothestea
Mean Ditterence: fo]
7 6 1075 10008 | 7 tabets
8 7 975 10004 | Ajpha:
9 8} 9991 10004 =
10 9} 1000 10004)| Output op

Bp @neowwonsnceae [__]
QW 10 1100 10004 O Qutput Range: (

y3 | ~ ~~} | ONew workbook

This has to be loaded by going to File->Options->Add-Ins.


130
OPMCO001
Business Statistics

Then load Analysis Tool-Pack. NOTES


“Data Analysis” will then be available under the “Data” Tab; then use “t-test: 2-sample using
unequal variances”.
Excel Output:

Fig. 8.2: Excel worksheet results for the Average Sales Performance example t-test
4 A | B c
1 |t-Test: Two-Sample Assuming Unequal Variances
2
3 Sales Performance | Average Sales Performance
4 |Mean 1013.9 1000
5 |Variance 2296.544444 0
6 |Observations 10 10
7 |Hypothesized Mean Difference 0
8 |df 9
9 |t Stat 0.917228146
10 | P(T<=t) one-tail 0.191472125
11 |t Critical one-tail 1.833112933
12 |P(T<=t) two-tail 0.382944249
13 |t Critical two-tail 2.262157163

Interpretation;

From the Fig. 8.2 results, t-STAT = 0.917 and the p-value = 0.3829. Because the p-value of
0.3829 is greater than a=0.05, Thus, do not reject HO. The data provide insufficient evidence
to conclude that the average sales performance differs from $1,000.

8.4 ONETAILTESTS
In the last section, we saw an example of hypothesis testing in the case of two-tail tests
because the rejection region is divided into the two tails of the sampling distribution of the
mean. In the same example discussed above, suppose our focus is on a particular direction.
Suppose that the manager is worried about the sales performance of employees and wants
to arrange some new training for employees only if the test sample of the last ten weeks saw
a decreased drive-through time.
To perform this one-tail hypothesis test, the following steps are given below:

Now, the objectives is to determine whether the average sales performance of employees is
less than $1,000.

Step 1 Need to define the null and alternative hypothesis:


HO: p> 1,000
H1:y<1,000

The alternative hypothesis contains the statement in which we are trying to find evidence. If
the conclusion of the test is “reject HO,” there is statistical evidence that the average sales
131
UNIT 8
Fundamentals of Hypothesis
Testing: One-Sample Tests

NOTES performance is less than $1,000. This would be a sufficient reason to arrange a new training
program for employees for better performance. If the conclusion of the test is “do not reject
HO,” then there is insufficient evidence that the average sales performance is less than
$1,000. If this occurs, there would be no reason to arrange training.

Step 2 Same data set which is asample of ten weeks (n = 10). We decide to use a =0.05.

Step 3 Because s is unknown, we will use the t distribution and the t-STAT test statistic. We
are assuming that the sales performance data is normally distributed.
Step 4 Now, in this case, the rejection region is entirely contained in the lower tail of the
sampling distribution of the mean. As the test is whether to reject the null hypothesis
therefore, we want to reject HO only when the sample mean is significantly less than $1,000.
Because, the entire rejection region is contained in one tail of the sampling distribution of
the test statistic, the test is called a one-tail test, or directional test. If the alternative
hypothesis includes the less-than sign, the critical value of t is negative. Finally, a decision
would be taken based on the p-value. As a is already established at 0.05 (level of
significance). Thus, to reject Ho when the p-value should be below it. This means that
whenever the p-value is less than 5%, Ho will be rejected.

A one-tailed test is explained using the same example, thus, the same Excel output will be
interpreted for considering other values.
Fig. 8.3: Excel worksheet results for the Average Sales Performance example t-test
(one-tailed test)

| A | B c
1 |t-Test: Two-Sample Assuming Unequal Variances
2
3 Sales Performance | Average Sales Performance
4 |Mean 1013.9 1000}
5 |Variance 2296.544444 of
6 |Observations 10 10|
7 |Hypothesized Mean Difference 0
8 |df 9
9 |t Stat 0.917228146
10 | P(T<=t) one-tail 0.191472125
11 |t Critical one-tail 1.833112933
12 | P(T<=t) two-tail 0.382944249
13 |t Critical two-tail 2.262157163

Interpretation;

From the Fig. 8.2 results, t-STAT = 0.917 and the p-value = 0.1914. Because the p-value of
0.1914 is greater than a = 0.05, Thus, we do not reject HO. The data provide insufficient
evidence to conclude that the average sales performance is less than $1,000.

132
OPMCO001
Business Statistics

8.5 ZTESTOF HYPOTHESIS FOR THE PROPORTION NOTES


To test a population proportion p, select a random sample, and compute the sample
proportion, ~=x/n has a sampling distribution that is approximately normal when the
sample size is reasonably large. Distribution of the standardized value is given by:

p-p
V¥P(l- p)/n

is approximately normal with a mean O and standard deviation 1.


Suppose P, be the borderline value of p between the null and alternative hypothesis. The p-
value of the test is found by seeing how much probability is beyond this test statistic in the
tail (or tails) of the standard normal distribution 1.
A rule of thumb for checking the large-sample assumption of this test is to check whether
np,>S5andn(1-p,}>5.

Then test statistic for the test of proportion can be written as:

p-—p
VpQ-p)/n

Illustration with an example using the Critical Value Approach


In a survey of 500 customers of Vodafone, 390 said that they are satisfied with the current
service. Suppose that a survey conducted in the previous year indicated that 80% of users
were satisfied. Is there evidence that the proportion of satisfied users has changed from the
previous year? To investigate this question, the null and alternative hypothesis are as
follows:

HO: p = 0.80 {i.e., the proportion of satisfied customers has not changed from the previous
year)

H1: p <>_ 0.80 {i.e., the proportion of satisfied customers has changed from the previous
year)
Since, we are interested in determining whether the population proportion of satisfied
customers has changed from 0.80 in the previous year, therefore we will use a two-tail test.
Suppose a@ = 0.05 level of significance.

Thus, now decision about rejection and acceptance will be taken as follows:

reject HO ifZ,,,,<-1.96 or Z,,,,>+ 1.96 otherwise, do not reject HO.

133
UNIT8
Fundamentals of Hypothesis
Testing: One-Sample Tests

Fig. 8.3 Excel Z test work sheet results for whether the proportion of customers has changed
from 0.80.
Fig. 8.3: Excel Z test Work Sheet
a4 a4 | oe | c¢ | D |
1 |Z Test of Hypothesis
for the Proportion
2 |Available information
3 |Null Hypothesis= | 0.8}
|Level
A of Significance= 0.05
5 |Number of Satisfied Customers= _ 390}
6 | 500}

{Standard Error
) |Z Test Statistic

‘12 |Lower Critical Value,


‘33|Upper Critical Value)
14\p-value |

The NORM.S.INV functions in cells E12 and E13 determine the upper and lower critical values
visualized in Fig. 8.3. The NORM.S.DIST function was used to calculate the p-value in cell E14.
As an alternative to the critical value approach, we can also use the p-value to decide
acceptance or rejection of the null hypothesis. For this two-tail test in which the rejection
region is located in the lower tail and the upper tail, we have found the area below a Z value
of -1.1180 and above a Z value of +1.1180. Fig. 8.3 reports a p-value of 0.2635. Because this
value is greater than the selected level of significance a = 0.05, Therefore, we may accept the
null hypothesis. Hence, there is no significant change in the proportion of customers from
the previous year’s survey.

8.6 LETUSSUMUP

This chapter includes fundamental concepts of hypothesis testing. A detailed discussion


about the formulation of the null hypothesis and alternate hypothesis in different situations
is provided. Estimation of mean in the case when the standard deviation is known and
unknown is conducted along with details of a few business applications. In this chapter, we
have discussed examples regarding the calculation of test statistics in case of mean and
proportion. Manual as well as excel-based approaches are discussed with suitable business-
related examples.

8.7 SELF-ASSESSMENT QUESTIONS


Q.1 Selectthe statement which holds for a statistical hypothesis:

a. Astatistical hypothesis is a generalized statement about sample statistics.


b. A Statistical hypothesis is a generalized statement about the population
134
OPMC001
Business Statistics

parameter.
c. Astatistical hypothesis is any random statement made by a researcher.

d. Astatistical hypothesis is asample statement.

Q.2 In hypothesis testing we need to define first:

a. Levelofsignificance

b. Samplesize
c. Typeoferror
d. Anullandalternate hypothesis

Q.3 In hypothesis testing, what is the assumption about the null hypothesis?

a. Thenull hypothesis is false.


b. Thenull hypothesis is true.

c. Thenull hypothesis cannot be rejected.


d. Thenull hypothesis is always true.
Q4 In the following example, null and alternative hypothesis are written as:
H,: #< 10
H,: #>10

Select a statement that holds.


a. Nullandalternative hypotheses are the same.

b. Nulland alternative hypotheses are mutually exclusive.


c. Nullandalternative hypotheses not right.
d. Nulland alternative hypotheses cannot be tested.

Q.5. In the following example, null and alternative hypothesis are written as:

H,:#<10

H,: #>10
Select a statement that holds.

a. A testisatwo-tail test
b. Thetestis left tailtest
c. Thetestis one tail test

d. Thetestis notail test

8.3 ANSWERS TO SELF-ASSESSMENT QUESTIONS


Q.1 b
Q.2 d
135
UNIT &
Fundamentals of Hypothesis
Testing: One-Sample Tests

NOTES Q3 b
Q4 b
Qas ic

136
OPMC001
Business Statistics

aA eR a

STRUCTURE
9.0 Objectives
9.1 Introduction
9.2. Comparing the mean of two independent Populations
9.3. Comparingthe mean of two related Populations
9.4 Comparing the Proportions of two Independent Populations
9.55 LetUsSumUp
9.6 Self-Assessment Questions
9.7. Answers to Sel-Assessment Questions

9.0 OBJECTIVES

After reading this unit, you will be able to:


e test hypothesis of difference in two means with known population standard
deviation
e test hypothesis of difference in two means with unknown population standard
deviation

e calculation of Z test and t-test in case two dependent populations.


e test hypothesis of differences in two population proportions

e® testhypothesis of the average difference in two related populations

9.1 INTRODUCTION

This chapter will introduce the student to analyzing data from two samples. The text will be
prepared with the notion that the student should be able to test hypothesis in the case of
two populations. Simple data sets are used to demonstrate the solution. The focus will also
be on the use of z statistic or at statistic for analyzing the differences in two sample means. It
will also discuss how to make a choice when population standard deviation is known or
unknown. The concept of pooled variance is also discussed in the chapter. Z test for a
proportion, t-test for independent and related samples are discussed with the helps of
suitable examples.
137
UNIT 9
Two-Sample Test

NOTES Two-sample problems are among the most common situations encountered in statistical
practice. Sometimes researchers wish to compare two individuals or groups or companies in
terms of their performance. In this study, we need to compare two random samples, one
from each of two different populations.

In two-sample problems, the purpose is to compare the responses in two groups, where
each group is considered to be a sample from a distinct population and independent
responses.

Once we have two independent samples, from two distinct populations then measure the
same variable in both samples. Suppose we call the variable x1 in the first population and x2
in the second because the variable may have different distributions in the two populations.
x1 is the mean of an SRS of size ni drawn from an N(p21, o1) population and x2 is the mean of
an independent SRS of size n2 drawn from an N(12, 62) population. Then the two-sample z
statistic can be written as:
_ &—-%,)-G —#2)

~ Vo, in)+(o,° /n,)

It has the standard Normal N(0, 1) sampling distribution. If the two population standard
deviations o1 and o2 are not known, we estimate them by the sample standard deviations s1
and s2 from our two samples.

Suppose that the two Normal population distributions have the same standard deviation.
Both sample variances S,’ and S,’ estimate 62. Combine estimates are weighted averages
where weights are equal to their degrees of freedom. The resulting estimator of o’ is:

2 — (n, —1)s? —(, —1)s;


? n,+n,—2

9.2 COMPARING THE MEAN OF TWO INDEPENDENT POPULATIONS


Generally, the two population standard deviations o61 and o2 are unknown. We
estimatethem by the sample standard deviations s1 and s2 from our two samples and
substitute the standard errors for the standard deviations in the two-sample z statistic.
The result is the two-sample t-statistic:

_ (x, —X,)-G4, — #2)


t
le? /n)+(sy" /n,)

In this case, to test the hypothesis HO: ,11 = 42, (assume .05 level-of-significance) we use P-
values or critical values for the t (k) distribution, where the degrees of freedom k are either
approximated by software or are the smaller of ni-1andn2-1.

138
OPMC001
Business Statistics

Example:

Assume that the recently released movies were randomly selected to get a rating from two
different sets of age group viewers. If the ratings are normally distributed then conduct a
statistical test to determine whether there is a significant difference between average ratings
by both groups.
Table 9.1: Information Table

Movies Rating (Scale: 0 to 100)


Movie No. Group 1 Group 2

i 90 35
2 81 33
3 61 10

4 54 50
5 60 48
6 73 73
7 44 33
8 30 11

9 25 58
10 38 18
11 52 12

12 32 61
13 16 96
14 8 94
15 18 80
Solution:

Using Data Analysis tool pack in Excel, (downloading explained in previous unit), t-test for
two sample assuming unequal variances is used and entered data appears as in Fig. 9.1

139
UNIT9
Two-Sample Test

Fig. 9.1: Selection of t-Test for Two-Sample Assuming Unequal variances

Fig. 9.2: Excel output of t-Test for Two-Sample Assuming Unequal variances

t-Test: Two-Sample Assuming Unequal Variances

Group 1 Group 2
Mean 45.46666667| 47.46667
Variance 606.8380952| 854.6952
Observations 15 15
Hypothesized Mean Difference 0
df 27
t Stat -0.202614346
P(T<=t) one-tail 0.420477582
t Critical one-tail 1.703288446
P(T<=t) two-tail 0.840955165
t Critical two-tail 2.051830516

Interpretation:
From the Fig. 9.1 results, tSTAT = - 0.2026 and the p-value = 0.8409. Because the p-value of
0.1914 is greater than a =0.05, Thus, do not reject HO. The data provide insufficient evidence
to conclude that there is a significant difference between average ratings by both groups.

140
OPMCO001
Business Statistics

9.3 COMPARING THE MEAN OF TWO RELATED POPULATIONS NOTES


When items or individuals are paired together according to some characteristics of interests,
they are called related populations. For example, impact on performance before and after
training of employees, advertising campaigns, a policy change, etc. Suppose, X11, X12....,
Xin represent the n values from the first sample. And let X21, X22, ..., X2n represent either
the corresponding n matched values from a second sample or the corresponding n repeated
measurements from the initial sample. Then D1, D2, ...., Dn will represent the corresponding
set of n difference scores such that D1 = X11 - X21, D2 = X12 - X22.,...., and Dn = Xin - X2n. To
test for the mean difference between two related populations, one can treat the difference
scores, each Di, as values from a single sample.

Assuming that the difference scores are randomly and independently selected from a
population that is normally distributed, we can use the paired t-test for the mean difference
in related populations to determine whether there is a significant population mean
difference. To test the null hypothesis that there is no difference in the means of two related
populations:
H,:2=0

H,:#0
Suppose the level of significance is « = 0.05 t-STAT test statistic for paired t-test can be written
as

t _ D- Ep
SAT Sy/Nn
Where x, is hypothesized mean difference,
D is the average of differences, S, is the standard
deviation of differences, and t-Test statistic follows a t distribution with n-1 degrees of
freedom.

Example:
ABC Network, Inc., want to investigate the impact of its marketing campaign conducted at
the beginning of this year. Before expanding their business, network managers wanted to
test whether this marketing campaign increased sales on average. Arandom sample of sales
of 10 products before and after the marketing campaign is collected. The paired
observations for each product are given in Table 9.2. Based on these data, Network
managers want to test the null hypothesis that there is no significant change in sales before
and after the marketing campaign versus the alternative hypothesis that it does. The
following solution to this problem introduces a paired-observation t-test.

141
Table 9.2; Total Sales of 10 Products Before and After Marketing Campaign

Products | Before Sales (in $1,000) | After Sales (in $1,000)


1 125 150

2 540 520
3 106 95

4 200 212
5 900 800
6 265 300
7 50 110
8 206 129
9 489 540
10 590 610

Fig. 9.2.1: Selection of t-Test Paired Two-Sample for Means

t-Test: Paired Two Sample for Means

Variable
2 Range:

a
@ New Worksheet
Ply |
Fig. 9.2.2: Excel output of t-Test Palred Two Sample for Means

es Loe
1 |t-Test: Paired Two Sample for Means

3 Before Sales (in $1000) | After Sales (in $1000)


4 |Mean 350.5 346.6
5 | Variance 71856.05556 63110.48889
| Observations 10 10]
7 |Pearson Correlation 0.98428895
Hypothesized Mean Difference 0
9 |df 9
10 |t Stat 0.251761984
11 | P(T<=t) one-tail 0.403439888
12 |t Critical one-tail 1.833112933
73 | P(T<=t) two-tail 0.806879775
att Critical two-tail 2.262157163

Interpretation:
Variances are squares of Standard deviations.

From the Fig. 9.2.2 results, t-STAT =.2517 and the p-value = 0.8068, Because the p-value of
0.8068 Is greater than a = 0.05, Thus, do nat reject HO {l.e HO: 2-0 against H1:jz, = 0}. The
data provide Insufficient evidence against the null hypothests. In other words, there Is no
significant difference between sales of 10 products before and after the marketing
campaign.

94 COMPARING THE PROPORTIONS OF TWO INDEPENDENT POPULATIONS


In case
of large sample sizes where distributions of the sample proportions
are approximated
by a normal distribution and the difference between the two sample proportions Is also
approximately normally distributed, leads to a test for equality of two population
Proportions based on the standard normal distribution.

The test statistic for the difference between two population proportions where the null
hypothesis difference |s zero Is given as

z= B, = PB, —0

V PO p)/ntl/m,+1/m,)
where ~,=x,/m, Isthe sample proportion In sample 1 and 2,=x, /n, is the sample
proportion in sample 2. The symbol stands for the combined sample proportion in both
samples, considered asa single sample. Thatis:

“a _*A +,

m +n,
UNITS
Two-Sample Test

When the null hypothesis is that the difference between the two population proportions isa
number other than zero, we cannot assume that 1 and 2 are estimates of the same
population proportion, and It Is not possible ta pool the two estimates. In such a situation,
the test statistic for the difference between two population proportions when the null
hypothesis difference between the two proportions is some number D, other than zero, is
given below
z= A—P,—D
¥2,0—6,)/m+5,0-2,)/m
To Illustrate the use of the Z test for the equality of the two proportions, we are golng to
discuss a problem of management institute where the management wishes to compare two
online platforms to select one most user-friendly platform to conduct an online evaluation of
the students. Management decided to float a test exam on both the platforms at different
points of time. During the exam on Platform 1 total of 200 queries were raised and 30 were
technical queries about the platform. Similarly, when the exam was conducted on platform
2, 175 queries were ralsed and 45 were technical queries, At the 0,05 level of significance, Is
there evidence ofa significant difference in both platforms in terms of user-friendliness?
Fig. 9.3: Excel Z test worksheet results for the difference between two proportions
for the two platforms
ae ie » :
p wy ; wy
Z Test for Differences in Two
Available information
Difference =
_ |Level of ificance=
Platform 1
Number of Technical Queries =
Total Queries
Platform 2
Number of Technical Queries =
0 |Total Queries
Intermediate Calculations
Platform 1
Platform 2
Difference in Two
A
Z Test Statistic

Lower Critical Value I


o/U Critical Value INV(1-D4/2
value 1-NORM.S.

144
OPMCO001
Business Statistics

In this worksheet, we have used the NORM.S.INC function to compute the lower and upper NOTES
critical values.
To compute p-value the NORM.S.DIST function is used.

As an alternative to the critical value approach, we can also use the p-value to decide
acceptance or rejection of the null hypothesis. For this two-tail test in which the rejection
region is located in the lower tail and the upper tail, we have found the area below a Z value
of -2.5877 and above a Z value of +2.5877. Fig. 9.3 reports a p-value of 0.0096. Because this
value is less than the selected level of significance a = 0.05, Therefore, we reject the null
hypothesis. Hence, there is a significant difference in both platforms in terms of user-
friendliness’.

9.5 LETUSSUMUP

In a real-life situation, we often need to conduct a comparison of two populations. This may
happen through the comparison of independent or related samples. There are instances
when the analysis is needed to be conducted for a given characteristic and comparison of
proportions comes into the picture. In this chapter, a detailed procedure and application of
these techniques are provided with appropriate examples. The two populations are studied
by comparing two sample means. The use of the z test and t-test is discussed. Assumptions
of normality, the concept of unknown and known population standard deviation,
independent samples, related samples, pooled variance are introduced to the student for
analysis of a given situation. Solutions are obtained manually as well through Excel.

9.6 SELF-ASSESSMENT QUESTIONS


Q.1. In hypothesis testing, to conduct a comparison of means of two random samples
whether both come from two different populations or the same, the null hypothesis
can be framedas:
a Mlb

b, MiSfe
c. M#pe
d. HiT#e
Q.2 In hypothesis testing, to conduct a comparison of means of two random samples
whether both come from two different populations or the same, alternate hypothesis
can be framed as:
a. Mize
b. MiSfe
c. Fb
d. Him #2

145
UNIT9
Two-Sample Test

Q.3 In hypothesis testing, a comparison of means of two random samples is conducted


whether both come from two different populations or the same. Suppose standard
deviation o1 and o2 are known, a=.01, then what could be the critical z value:

a. -2.58and 2.58

b. -1.96and 1.96

c. -1.28and 1.28
d. -2.33and 2.33

Q.4 In hypothesis testing, a comparison of means of two random samples is conducted to


determine whether both come from two different populations or the same. Suppose
standard deviation o1 and o2 are known, a= .05, then what could be the critical z
value;
a. -2.58 and 2.58

b. -1.96 and 1.96


c. -1.645 and 1.645

d. -2.33 and 2.33


Q5 In hypothesis testing, a comparison of means of two random samples is conducted
whether both come from two different populations or the same. Suppose standard
deviation ol and o2 are known, a=.10, then what could be the critical z value:

a. -2.58 and 2.58


b. -1.96 and 1.96
c. -1.645 and 1.645
d. -2.33 and 2.33

9.7 ANSWERS TO SELF-ASSESSMENT QUESTIONS

Q.1 d

Q.2
Q.3

Q4

Q5

146
OPMC001
Business Statistics

ANALYSIS OF VARIANCE

STRUCTURE
10.0 Objectives
10.1 Introduction
10.2 Applications of ANOVA
10.3 The Completely Randomized Design: One-Way ANOVA
10.4 The Factorial Design: Two-way ANOVA
10.5 The Randomized Block Design
10.6 LetUsSumUp
10.7 KeyWords
10.8 References and Suggested Additional Readings
10.9 Self-Assessment Questions
10.10 Check Your Progress — Possible Answers
10.11 Answers to Self-Assessment Questions

10.0 OBJECTIVES
After completion of this unit, you will be able to learn:

e themeaningand concept of analysis of variance


¢ comparison of means of more than two groups {or populations)

e applications of analysis of variance


e types of analysis of variance and situations of using them
e concept of completely randomized design (one-way ANOVA)

e® concept of randomized block design (two-way ANOVA)


e partitioning
the sources of variation

® concept of Sum of Squares


e calculation of Mean Sum of Squares

e hypothesis testing of more than two groups (or populations)

147
UNIT 10
Analysis of Variance

NOTES 10.1 INTRODUCTION

In the last chapter, we have compared two populations using a t-test. There are many
situations in practical life when we are required to compare more than two populations. The
use of a t-test is limited in such situations. The technique used to compare more than two
populations or subgroups of a population is called ANOVA {an acronym for Analysis of
Variance). ANOVA is effectively used to compare populations each containing several levels
or subgroups. Since ANOVA (analysis of variance) is the ratio of variances at its core, it uses F
distribution to test hypothesis.

ANOVA comes in many forms. One-way ANOVA (Single-factor ANOVA) has only one factor
and several groups (or levels) of the population are to be compared. A study conducted to
compare the mileage of three models of car would involve a one-way ANOVA, since it has
only one factor i.e. car model with three groups or levels (three models of the car). In the
experimental design context, one-way ANOVA is called a completely randomized design.
In a factorial design, more than one factor can be simultaneously studied in a single
experiment. In a two-way ANOVA design, there is a second factor also that accounts for
variation across various groups (or populations). If three car models are further grouped into
petrol and diesel to compare the mileage of four models, the design is called Two-way
ANOVA. The fuel used acts as the second factor with two levels (petrol and diesel).
Two-way ANOVA has two forms — two-factor without replication and two-factor with
replication. Two-factor without replication is also called as the randomized block design.

10.2 APPLICATIONS OF ANOVA

Irrespective of discipline, ANOVA can be used in every situation where a comparison is being
made between more than two groups or populations. It suggests a robust procedure for
hypothesis testing in such situations. An illustrative list of its applications in the field of
business and managementis given below:

e Comparison of the effectiveness of various training programs.


© Comparison of equity prices among different types of markets.

e Examining the difference in the rate of return across various types of investments.
® To determine if there is legitimate variation in the demand of a product across
various price points.

¢ To find out whether there is a difference among various Chinese snacks based on
the cooking time.

10.3 THE COMPLETELY RANDOMIZED DESIGN: ONE-WAY ANOVA

In One-way ANOVA there is only one factor that is used to categorize the data in various
groups (populations or samples). This single factor has many levels or groups. These levels
can be categorical or numerical. Let’s illustrate this with an example.

148
OPMC001
Business Statistics

Dy
Example 10.1: A car rating agency wants to compare the mileage of three car models,
belonging to a particular car a segment, manufactured by an automobile company. It collects
mileage data from a sample of 6 cars, in each model category, to check whether the mileage
of different car models is different. The following data about mileage (in km) of 18 cars having
the same age and are driven under the same conditions were collected to control for
variation due to these factors:
Table 10.1

Model A Model B Model C

15 8 12
18 7 19

17 10 18
19 15 12
19 14 17

20 14 14

The comparison of the mileage of three car models requires one-way ANOVA, as there is a
single factor (car-model} with three levels (Model A, Model B, and Model C). These levels can
be categorical or numerical. In our example, three levels (Model A, Model B, and Model C)
are categorical levels. If we would categorize the cars on a factor like a price with levels - 5
Lakhs, 7 Lakhs, 10 Lakhs, and 12 Lakhs etc., then the levels would be called numerical.
One-way ANOVA is also known as single-factor ANOVA. In the experimental design context, it
is called a completely randomized design because units are randomly selected and assigned
to groups. For example 10.1, the cars of each model have been selected randomly to make it
acompletely randomized design.

10.3.1 COMPARISON BETWEEN GROUPS

Analysis of variance (ANOVA) procedure requires comparing the means of the groups. If the
means of the groups are not different from each other, the groups can be considered as part
of the same larger population, and the null hypothesis, ‘all population (group) means are
equal’, fails to get rejected. But, if means are significantly apart, the groups are considered
belonging to different populations, and the alternate hypothesis, ‘all population (group)
means are not equal’, gets accepted.

In case there are three populations (or groups), the ANOVA procedure requires comparing
means of three samples to see ifa meaningful difference exists among them. In other words,
we can measure the relative distance between the three sample means. To measure the
relative distance between three samples means, each sample mean is compared with the
overall mean (mean of the population in the background). The following three situations
may arise:
149
UNIT 10
Analysis of Variance

(i) If allsample means are equal, these samples come from the same population. We
conclude that there is no difference between the three populations. Three
different populations don’t exist in such a situation and the null hypothesis
(H,: 11 = 42 = 3) fails to get rejected.
(ii) If one of the means is far away from the other two, it is likely it is not from the
same population. In such a case, two of the three samples belong to a common
population but the third sample belongs to a different population. Since the
mean value of all samples is not the same, we reject the null hypothesis
(H,: wl = 2 = 3) and accept the alternate hypothesis (all population means are
not equal).
(iii) If allthree means are so aart, they all come from unique populations. In this case,
too, the alternate hypothesis (that all population means are not equal) gets
accepted.
To test our null hypothesis H,: 41 = 2 = 3 we are not checking if the means are exactly
equal. We are checking if each population comes from the same larger overall population or
not.

CHECK YOUR PROGRESS -|


Q.1 One-way ANOVAcan be used to compare the means of four groups. (True/False)

Q.2. Analysis of variance is:

(a) Ratioofmeans
(b) Ratio of variances

(c) Product of variances


Q.3 Factor-levels can be or

10.3.2 VARIANCES OF COMPLETELY RANDOMIZED DESIGN

Each set of data (group) is a distribution in itself with its own mean and variance (since all
values in the data set are not the same as the mean of the data set). The idea of variability is
of particular importance in ANOVA. For Example 10.1, we were checking if there was a
significant variation between the mileage of cars. Some proportion of variation in mileage of
cars will be due to differences in models. Models may differ in gross weight, engine capacity,
number of cylinders, etc. This proportion of variance is called explained variation or direct
variation because we want to compare the mileage based on car models. This variation has
been accounted for. The other proportion of variation in mileage will be due to reasons other
than the difference in car models. Such factors are not accounted for in our study. This
proportion of variance is called unexplained variation or error variation. Though we are
trying to control some factors like age and driving conditions, there may still be some factors
like driving habits, driving hours, and others that we have not concerned with but can cause
error variation to come into the picture. Thus, the total variation is composed of two parts -
150
OPMC001
Business Statistics

the explained variation and the error variation.

10.3.3 TYPES OF VARIATION


These two parts of variation in one-way ANOVA are - first, that is due to variation among
groups and second, that is due to variation within groups. The first type is called between
columns variation (among-groups variation) and the second type is called within groups
variation (error variation). First is the variance of each sample mean from the overall larger
population mean, and second is the variance or spread of each distribution.

10.3.3.1 TOTAL VARIATION

The two variations (between columns and within groups) add up to total variation. It is the
variance of all data values from the overall population mean.

10.3.3.2 BETWEEN COLUMNS VARIATION

It is the variance of sample means from the overall population mean. It comes into the
picture due to the difference in columns or groups.

10.3.3.3 WITHIN GROUPS VARIATION


It is the spread of each distribution. It comes into the picture due to the differences in values
within each group (distribution).

Total Variation = Between columns variation (explained variation} + Within groups


variation {unexplained variation)

ANOVA, at the core, is a variability ratio. It is the ratio of between-columns variation to


within- groups variation.

Between-columns variation

Within-groups variation

If the between-columns variation is larger than the within-group variation the ratio will be
larger than 1. The samples in all likelihood, do not in such a situation originate from the
common population. In this case, the null hypothesis, ‘all means are equal’, gets rejected.
Otherwise, the null hypothesis fails to get rejected.

The above variations are represented by the Sum of Squares (SS}.

10.3.4SUM OF SQUARES
Variance is the average sum of squared deviations from the mean. The expression for
variance is:
n
. yy\2
Sample Variance, S’= Dai x)
n-1 151
UNIT 10
Analysis of Variance

NOTES N _
(xi-X)?
Population Variance,o2= —= 1
N

Ld —

In the above expressions, the numerator > ~%)* isthe sum of squares. If we divide SS
by n or n-1 we get the variance of a data set. Therefore, SS is variance without finding the
average of the sum of squared deviations.
The concept of the sum of squares (SS) is fundamental to variation and hence ANOVA. Each
type of variation is represented by its corresponding sum of squares (SS). Total variation is
represented by the sum of squares total (SST), between-columns variations are represented
by the sum of squares column (SSC), and within-group variation is represented by the sum of
squares within (SSW). Each SS in ANOVA has its degrees of freedom.

10.3.4.1 SUM OF SQUARES TOTAL (SST)

Itis the sum of squared deviations between each observation in the data and the grand mean
(mean of all values in the data). For N values in the data, there are N-1 degrees of freedom
associated with SST.

10.3.4.2 SUM OF SQUARES COLUMN (SSC)


Also known as the sum of squares among groups, it is the sum of squared deviations
between the samples mean of each group and the grand mean, weighted by the sample size
in each group. For C groups, there are C-1 degrees of freedom associated with SSC.

10.3.4.3 SUM OF SQUARES WITHIN (SSW)


Also known as error sum of squares, it is the sum of squared deviation between each value of
the distribution and the mean of that distribution, which is summed over all groups. Because
each of the C groups contributes n,-1 degrees of freedom, there are N-C degrees of freedom
associated with SSW.

152
OPMC001
Business Statistics

Table 10.2: Sum of Squares

Type of SS Expression Terms Used Degrees of


Freedom (df)
N
SST Where Xij is an__ individual
> au-Xo) observation, Xg is the grand mean, and N-1
ij=1

N is the total number of values in the

data.
Cc
SSC Where Xiis the mean of a distribution,
>, ii
- XY Xq is the grand mean, n;is the number
$1

of values in a group, and C is the

number of groups.
c
SSW WhereXij is individual value in the
Yow — Xi)? distribution, Xi is the mean of a
j=1

i=1 distribution, n is the number of values in

the distribution, C is the number of


groups, and N is the total number of

values in the data.

The sum of squares total (SST) is partitioned into two parts-the sum of squares among (SSC)
and the sumof squares within (SSW), i.e. SST =SSC
+ SSW

10.3.5 MEAN SUM OF SQUARES

Since variance is the mean of squared deviations (sum of squares divided by the number of
observations in case of population or degrees of freedom in case of the sample), the mean
sum of squares is integral to the calculation of variances in ANOVA.

The mean sum of squares corresponding to SST is the mean sum of squares total (MST), the
mean sum of squares corresponding to SSC is the mean sum of squares column (MSC), and
the mean sum of squares corresponding to SSW is the mean sum of squares within (MSW).

sum of squares (SS)


Mean Sum of Squares (MS) =“degrees of freedom’ therefore:

N -1

- C. 1 aves
eeeee ..Equation 10.2

153
UNIT 10
Analysis of Variance

NOTES
MSW = BA! Equation 10.3
N-C

10.3.6 HYPOTHESIS TESTING


Although we want to compare the means of C groups to determine whether a difference
exists among them, the name ANOVA comes from the fact that we are comparing variances.
If the null hypothesis is true and there are no differences in the C group means, all the three
mean sum of squares (or variances) — MSC, MSW, and MST provide estimates of the overall
variance in the data. Thus, testing of null hypothesis:

Ho: by = He = My »++0+=Me
against the alternative

H,: Notall p, are equal (wherej=1, 2,....c)


requires F statistics (because we are testing hypothesis for more than two populations)
which is the ratio of MSC to MSW.

MSC
vse EQuation 10.4
MSW

Since F (F,,.,) is the ratio of two variances, it will follow the F distribution, with C-1 degrees of
freedom in the numerator and N-C degrees of freedom in the denominator.

For a given level of significance, a, we reject the null hypothesis if the calculated F,,,, value is
greater than the upper-tail critical value, Fon, from the F distribution with C-1 degrees of
freedom in the numerator and N-C degrees of freedom in the denominator.

Thus the decision rule is Reject H, if Fyrar > Foencat


Otherwise do not reject H,
If the null hypothesis is true, the calculated F statistics are expected to be approximately
equalto 1. Ifthe null hypothesis is false then F statistics is expected to be larger than 1.

154
OPMCO001
Business Statistics

(Pictorial chart of an “F” diagram is required here e.g)


Fig. 10.1

F PDF (One-Sided Test at Alpha = 0.05)


08
:

| a = 0.05 (2.98)
|

0 1 2 3 4 5 6

Upper 5% points

wy
Ye 1 2 3 4 5 6 7 B 9 10 | 12 | 15 | 20 24 | 30 | 40 | 60 | 12% | ©
1 |161-4 | 199-8 215-7 | 224-6 | 230-2 |234-0 | 236-8 | 236-9 | 240-5 | 241-9 | 2439 | 246-9 | 248-0 249-1 | 2501 | 251-1 | 252-2 | 253-3 | 254-3
2 | 18-51) 39-00 19-16) 19-25/ 19-30/ 19-33| 10-33| 19-37| 19-38] 19-40| 19-41 18-43| 19-45) 19-45 | 19-46] 19-47| 19-49| 19-49 19:50
3 | 1013] 9-55 9-98] 9:12) 9-02| 8-04/ $-89/ 885] 8-81) 8-79/ 874| 8-70| 866 864) 862) 8-59| 8:57) 8:55) 853
4 | 771) 694 6-59) 639/ 6:26) 616] 6-00) 604) 6:00) 5:96) 502) 5:86) 580 6-77| 575) &72| 569| 5-46) 5-63
8 | 661) 6-79 6-41) 6-19/ 5-05) 4-95/ 4:88) 482) 4:77) 4:74/ 468) 462) 450) 4:53/ 450) 446/ 443| 4:40) 4:36
6 | G99) GIs 4-78) 453) 4:30) 4:28] 4-21) 415) 4:10! 4:06] 400) 3:04! 3:87) 384] 381) 377) 3-74| 270| $67
7° | BSO) 4:74 4-35) 4:12| 3:97] 3-87| 3:70/ 3-73) 3-68) 3-64) 3:57/ 351} 344 341) 338) 3:34) 3-30) 3:27| 3-23
8 | G32) 446 4-07] 884 369| 8:58| 8-50/ 344| 38:89) 3:36/ 8-28) 328) 3:16 812| 308) 3:04) 301] 297| 293
% | Giz) 426 3-66) 363/ B48) 3:97) 3-29) 223] 9-19] 8-14) 3-07) 3-01] 204 290) 283) 283) 270| 295 2-71
1@ | 406) 4:10 3-71) 348) B33) 3-22) B14) 3-07) 3:02) 298) 201) 2:85) 277) 274) £70) 266) 262/ 2:48/ 2-54
11 | 484] 3-98 3-68/ 336] 820] 8-09/ 8-01/ 295/ 290/ 285/ 279) 272/ 266 261| 257| 253/ 249) 246/ 2-40
12 | 476) 389 8-40) 9:26) B11] 8:00/ 2-91] 285) 2-80/ 275/ 269) 262/ 254 251| 247/ 243) 298) 234/ 2-30
13 | 467) 882 3-41] 9:18) 808) 2-62) 288) 277) 271! 263) 260) 268) 246 242) 238/ 234) 2:30| 2:25| 221
14 | 660) 3:74 3-34) 311/ 2096] 2:85/ 276) 270/ 2-65| 2-60 263) 346/ 239) 235/ S31) 227) 223] 218) 213
15 454| 2-68 2-28 3 290) 2:79 271 2-64 2-59 2-54 248) 240) 2-33 2-29 225 220) 2:16) 2-11 2-07
16 | €49) 3-63 3-24) S01] 2:85) 2-74] 2-66) 259/ 2:54| 2-49/ 242) 235| 2:28 224/ 219) 215) 211] 206) 201
17 | 445) 3:59 3-20) 2 2-70| 261 255) 2-49/ 245] 238) 231| 223 219/ 215) 210) 206) 2-01| 1-96
18 | 441/ 3:55 3-16] 9:93] B77| 966| @s8| 251| 246) 3-41) 231| 227) 210 215/ 211/ 206) 202/ 197| 192
19 | #38] 3:52 3-13] 2-90 263| 264) 248| 242) 238] 231) 223) 216 211] 207| 208) 1-98) 1:93) 1-88
ae | 435) 349 3-10) 987/ 271) 3:60/ 261] 845) 239] 230/ B28| 320| 312) 2-03) Z04/ 1-99) 1:95| 1:90| 184
21 | 432) 3-47 3-07) 284) 268) 257) 249| 248) 237] 3:32) 225| 218] 210 2-05) 201] 1-96) 103) 1-87) 1-81
at | 4-30 3-05| 282 266| 255| 248] 240/ 234| 2-30/ 323/ 215] 207 203) 1:98] 1-94) 1-80/ 1-84] 1-78
23 | 428) 3-42 3-03) 2 264) 2653) 244] 237) 232| 2-27) 220/ 313| 205° 201) 196/ 1-91) 1-86) 1-81| 1-76
2% | 426) 340 3-01) 278) 262) 251) 242] 236) 230] 225/ 218/ 211] 208 1-93) 3104] 180) 1-84) 170) 1-73
25 | 424] 839 2-90) 276) 260/ 249/ 240) 284/ 228) 224) 216/ 2:00) 201 1:96) 1:02) 1-87] 183) 177) 71
26 | 423/ 3:37 2-08) 274) 259/ 247| 239) 232] 227) 222) 215| 207) 1-09 1-05) 1-00) 1-85] 1-80/ 1-75| 1-69
2? | 4-21] 285 9-08) 9-73) 957/ 29-46| 2-37) 981) 225) 2290/ 212| 2:08) 1-97 1-93) 189) 1-84] 1-79/ 1-78| 1-67
28 | 420) 3:34 2-95) 271/ 256/ 246/ 236| 229) 224) 219] 212 1-96 1-91| 187) 188) 177) L71| 1-85
29 | 418] 3:33 2-03) 270) 2655| 243) 235) 228] 222) 219/ 210) 203) 1-04 100) 186) 1-81] 1-76) 1-70| 1-66
3 | #17) 332 2-092) 269) 259) 242) 233| 227) 221 210) 209) 201] 1-93 1:89) 184] 1-79) 1-74| 1-68/ 1-62
4 | 408/ 3:23 2-84/ 261) 2 2:34| 225) 218) 212] 2-08] 200) 1-92] 1: 1-79| 1-74] 1-69] 1-64] 1-68) 1-55
60 | £00] 3:15 2-76] 253) 237| 2-25/ 217/ B10) 2 199} 302) 1:84] 1-75 1-70] 106] 1-59/ 153] 147| 1239
12% | 3-92) 3-07 245) 2 2:17) 2-00) 202) 1-96) 1-91] 1-88) 2-75] 166 1-61) 155) 1-50) 1-43| 1-35| 1-26
a | 384| 3:00 268) 237/ 221/ 210] 201) 104] 1-88] 1-83] 1-75) 1-67/ 1-57 153] 146) 1-89/ 1:82/ 1-22| 100

F= f= 21/22, where s}= 5,/r; end s4=5,Jvpare independent mean squares estimating a commox variance o? and based ony and v; degrees of freedom, respectively.
2 1

Solution 10.1:
155
UNIT 10
Analysis of Variance

To compare the mileage of three car models by plotting mileages of various cars. We will use
one-way ANOVA as discussed in the previous section and test the following hypothesis:

Null Hypothesis: Mean mileage of three call models are equal (H,: X,=X, =X,)
Alternate Hypothesis: Mean mileage of three call models are not all equal (H1)

MSC
The procedure requires calculating the F ratio (F Statistic), F=
MSW

If the calculated value of F is more than the critical value of F, we will reject the null
hypothesis. Otherwise, we will fail to reject the null hypothesis.
The process for calculating F is shown in the following Table 10.3 and in the working notes
that follow:
Table 10.3: Calculations of SS, MS and F
@) (2) (3) (4) 6) (6) (7) (8)

ModelA | ModelB Cy} Kirke | Ki-X,)? (Ksi-X3)? Ki-X,)? (Ki-X,)?


(1) | (2) | (3)
11.11 11.11 0.01 47.46
18.78 13.44 9.68 62.23
1.78 7AL 4.46 23.90
13.44 11.11 16.90 0.01
TAL 2.78 16.90 0.79
7Al 1.78 26.12 0.79

1.33 = 59,33 = 47,33 =74 = 135.19 = 48.52

85

Working Notes:
(1) Calculate the mean for each sample (group): X,=18; X,=11.33; X,=15.33 (Column 1,
2, 3in above table)

(2) Calculate the grand mean (mean of all 18 data values): X, = 14.89
(3) Calculate the sum of the squared differences between the sample mean and the
values in the distribution for each distribution: 5(X,-X,)' = 16; 5(%,-X,)’ =59.33;
D(X, -X,)'= 47.33 (Column 4, 5, Gin above table)
(4) Add the values calculated in point 3 to get SSW= 122.67

(5) Calculate the sum of the squared differences between data values and grand mean
sample for each distribution:
(X,-X,)’ = 74.07; 5(%,-X,)* = 135.19; 5(%-X,)° = 48.52 (Column 7, 8, 9 in above
table)
(6) Add the values calculated in point 5 to get SST= 257.78
156
OPMC001
Business Statistics

(7) Calculate the sum of the squared differences between the sample mean and the
grand mean weighted by the number of observations in the sample for each
distribution: 5n,(X,-X,). =58.07; 5 n,(X,-X,)= 75.85; 5n, (X,-X,) = 1.19
(8) Add the values calculated in point 7 to get SSC= 135.11
(9) Calculate the mean sum of squares;
SSC 135.11
Mean Sum of Squares among groups (MSC}= (C-1 j 3-4 = 67.56 where c=no.
of groups.

Mean Sum of Squares within groups (MSW) = mae = ve 2 =8.18 where


N=number of observations.

(10) Calculate the F statistics; F = BSS AtoF 8.26


MSW 8.18

HYPOTHESIS TESTING
The critical value of F with C-1 = 2 degrees of freedom in the numerator and N-C= 15 degrees
of freedom in the denominator for a = 0.05 in the F distribution table is F.44. = 3.68. Since
calculated F,,.> Fc: (8-26> 3.68), the null hypothesis gets rejected. Hence, the mileage of
three car models is not equal and all three samples do not come from the common
population. With sample means X,=18; X,=11.33; X,=15.33, and grand mean (overall mean)
x, = 14.89 it is observed that the first sample (Model A) with X,=18 is oddball distribution and
comes from a different population as compared to Model B and Model C. In conclusion, all
models do not have the same mileage and this variation in mileage is attributed to the
difference in models.

CHECK YOUR PROGRESS - II


Q.1 Consider an experiment with four groups with eight values in each. For the ANOVA
summary table below, fillin all the missing results:

Source of Variation DF Ss MS F

Between-columns 80

Within-groups 560

Total

10.4 THE FACTORIAL DESIGN: TWO-WAY ANOVA

In this section, we will extend the single-factor completely randomized design (one-way
ANOVA) to a two-factor factorial design, in which we have two factors rather than one. When
there are two factors to be evaluated for making a comparison between groups, the
procedure used is called two-way ANOVA. Each of the two factors should have two or more
levels to form a factorial design in the strict sense. If any factor has only one level, then the
design takes the form of one-way ANOVA itself. The factorial design may be extended to
157
UNIT 10
Analysis of Variance

three factors or more factors, in such situations, the comparison of groups will require a
more advanced procedure called MANOVA (multiple analysis of variance).

Table 10.4

Model A Model B Model C


City 1 15 8 12
City 2 18 7 19
City 3 17 10 18
City 4 19 15 12
City 5 19 14 17

City 6 20 14 14

Table 10.4 provides a situation for use of two-way ANOVA factorial design because there are
two factors used to categorize the data. The second-factor city (with six levels) is added to
look for variation in mileage of cars based on the difference in cities where cars are driven
together with the difference in car models. A two-way ANOVA allows us to account for
variation at the row-level due to some other factor (in the column). The levels of the second
factor are called blocks and the second factor in rows is called the blocking variable.

The two-way ANOVA comes in two flavors: Without Replication and With Replication

10.4.1 TWO FACTOR WITHOUT REPLICATION

In this type of design, data units are assigned randomly on two factors - one across columns
and the other across rows, without any specific sequence. It is also called a randomized
block design. Table 10.4 is an example of this type of design.

10.4.2 TWO FACTOR WITH REPLICATION

In this type, data units corresponding to column (same units in the sample) are repeated over
row factors across different levels i.e. different samples are considered for every Model and
replicated across the rows defined in Table 10.4. So, there are multiple measurements per
row. This is called with-replication. Table 10.5 provides an example of this type of factorial
design.

158
OPMCO001
Business Statistics

Table 10.5

Model A Model B Model C


15 8 12

18 7 19

City 1 17 10 18
16 9 16
19 15 12

19 14 17

City 2 20 14 14

18 16 15

14 7 12

City 3 19 10 17
18 12 10

14 14 12

Four cars belonging to models A, B, and C each are driven in city 1 and mileage readings are
taken, then the same four cars belonging to models A, B, and C are replicated in city 2 to take
mileage readings and then again replicated in city 3.

10.5 THE RANDOMIZED BLOCK DESIGN

As discussed above, the two-factor design without replication, in an experimental context, is


called the randomized block design. In this design, the variation in the values of a variable
shall be measured for the second factor along with the first factor.

10.5.1 VARIANCES OF RANDOMIZED BLOCK DESIGN


In one-way ANOVA, variance remaining after accounting for group variance is error variance.
Since there is a second factor available in two-way ANOVA, out of the entire variance some
part of variation will be due to row variance. Therefore, we attempt to minimize the error
variance by accounting for some variance due to variance in the rows. Because of this, the
differences between columns (groups) are easier to detect. So, we have four types of
variance (or variation).

10.5.1.1 TOTAL VARIATION

Three variations (between columns, between rows and error) add up to total variation.

159
UNIT 10
Analysis of Variance

10.5.1.2 BETWEEN-COLUMNS VARIATION

This variation is accounted for due to differences in columns or groups.

10.5.1.3 BETWEEN-BLOCKS (ROWS) VARIATION

This variation is accounted for due to difference in rows.

10.5.1.4 ERROR VARIATION

Remaining part of variation which is not due to the difference in either column or rows. This
is an unexplained variation.
Total Variation = Between-columns variation + Between-blocks variation + Error variation

Total variation is represented by the sum of squares total (SST), between-columns variation
is represented by the sum of squares column (SSC), between-rows variation is represented
by the sum of squares blocks (SSB}, and error variation is represented by the sum of squares
error (SSE). Each SS hasits degrees of freedom.

10.5.2 PARTITIONING SUM OF SQUARES

SST=SSC+SSB+SSE .........:000s000000 Equation 10.5

N
SST = >. (Xij —Xg)*__withN-1 degrees of freedom
ij=1

c — —

SSC = nj(Xi—Xg)* with C-1 degrees of freedom, n, is the number of


i=1
observations in anj” group
(i= 1, 2......C)

B — —

ssB=) rit — Xg)’ with B-1 degrees of freedom, r, is the number of


i=

observations inan i" group (i=1, 2......B)


SSE = SST - SSC- SSB with (C-1) x (B-1) degree of freedom

10.5.3 MEAN SUM OF SQUARES

The mean sum of squares corresponding to SST is the mean sum of squares total (MST), the
mean sum of squares corresponding to SSC is the mean sum of squares column (MSC), the
mean sum of squares corresponding to SSB is the mean sum of squares blocks (MSB), and
mean sum of squares corresponding to SSE is mean sum of squares within (MSE).

MST = SOT caseusceceeeu Equation 10.6


N-1

SSC
MSC 3 rsiscisiscvcccssins Equation 10.7
160 C-1
OPMCO001
Business Statistics

MSB eee: seescauenewauveunnts Equation 10.8 NOTES

SSE
eetesesnnsesvareeess Equation 10.9
MSE ~ (B-1)x(C-1)

10.5.4 HYPOTHESIS TESTING


Two F ratios will be calculated by using the following equations:

F= MSC... Equation 10.10


MSE

F,= MEE ecsscs Equation 10.11


MSE

F1 will follow the F distribution, with C-1 degrees of freedom in the numerator and
(C-1)x (B-1) degrees of freedom in the denominator.

F2 will also follow the F distribution, with B-1 degrees of freedom in the numerator and
(C-1) x (B-1) degrees of freedom in the denominator.
For a given level of significance, a, we reject the null hypothesis if the calculated F1 value is
greater than the upper-tail critical value, Francs, from the F distribution with C-1 degrees of
freedom inthe numerator and (C-1) x (B-1). Thus, the decision rule is:

The null hypothesis:

Ho: Ha = He = Hageeoee = He
will be reject if F, > Fosneu

Otherwise do not reject H,


If the null hypothesis is true, the variation between groups will be larger as compared to the

error variation making the difference between columns visible. The ratio, F2 = M56 will

determine the proportion of variance between columns as compared to error variance. This
will how be look at column difference by controlling
the variance between rows.

Example 10.2: A multinational company has designed a comprehensive program for training
its executives. There are four different types of training programs. To evaluate the
effectiveness of these training programs company divided its executives into four groups and
different groups were offered different training programs. After the training was over
executives were asked to appear in the skill-aptitude-enhancement examination test. Five
executives belonging to different levels of experiences were randomly selected from each
training program and their scores were recorded in the table given below.

161
UNIT 10
Analysis of Variance

Table 10.6
Training Programme

Experience {in years})| TP-A TP-B TP-C TP-D

0-2 10 20 10 30

2-5 10 30 5 50

5-10 20 30 10 20

10-15 25 30 20 20

More than 15 25 40 20 40

The management wants to know whether the training programs are equally effective i.e.
whether they result in the same level of skill-aptitude enhancement or not. Simultaneously,
management also wants to examine whether there is some difference in the effectiveness of
programs based on the experience of employees.
What makes this problem fit for two-way ANOVA is that the executives themselves will have
their natural variation due to their different years of experience. The two-way ANOVA allows
us to account for experience variation to better determine if a difference exists among
training programs without experience variation masking any training differences. So, we can
untangle all of the sources of variation. We are interested in differences between the training
programs but since we are dealing with human beings, we have to account for natural
variation that exists among the executives themselves. We need to extract that variation
before looking at any difference that exists between training programs. And since the
sampling method involved is random without any specific sequence of selecting executives,
itisa randomized block design.

Solution 10.2:

Step 1: Calculation of Mean Values

TP1 TP2 TP3 TP4 Block Means

0-2 10 20 10 30 17.5

2-5 10 30 5 50 23.75
5-10 20 30 10 20 20
10-15 25 30 20 20 23.75
Morethani5 | 25 40 20 40 31.25
Group Means | 18 30 13 32 Grand Mean = 23.25

162
OPMC001
Business Statistics

Step 2: Calculation of SST

X (X, - 23.5) X, | (%,-23.5)


10 175.56 10 175.56
10 175.56 5 333.06
20 10.56 10 175.56
25 3.06 20 10.56
25 3.06 20 10.56
20 10.56 30 45.56
30 45.56 50 715.56
30 45.56 20 10.56
30 45.56 20 10.56
40 280.56 40 280.56
SST= 2563.75

Step 3: Calculation of SSC

Group Means (Group Mean - 23.25)


18 27.56
30 45.56
13 105.06
32 76.56
Sum = 254.75

SSC = 254.75 x B = 254.75 x 5 = 1273.75

Step 4: Calculation of SSB

Block Means (Block Mean-23.5)"


17.5 33.063

23.75 0.250

20 10.563

23.75 0.250

31.25 64.000

Sum = 108.125
163
UNIT 10
Analysis of Variance

SSB = 108.125 x C= 108.125 x 4 = 432.50

Step 5: Calculation of SSE

Since SST = SSC + SSB + SSE,

SSE = SST— SSC — SSB = 2563.75 - 1273.75 - 432.50 = 857.50

Step 6: Calculation of Mean Square Terms

SST ssc | SSB SSE


2563.75 | 1273.75| 432.5 857.50
df N-1 C-1 B-1 | (C-1)x(B-1)
20-1 4-1 5-1 | (4-1)x (5-1)
19 3 4 12
MsT | MSC | MSB MSE
134.93 | 424.58 | 108.13 71.46

Step 7: Calculation of F Ration

2 MSC _ 424.58 5 9,
MSE 71.46
= MSB _ 108.13 _4 5,
MSE 71.46
Step 8: Hypothesis Testing

a) At a=0.05 critical Value of F (according to F distribution table) with 3 degrees of


freedom in numerator(number of columns - 1) and 12 (Number of columns - 1) x
(number of rows -1) degrees of freedom in the denominator, F,,,,= 3.49

Since F,> F.., (5.94 >3.49), the null hypothesis: H,: 1, = 4, =p, gets rejected.

We reject the assumption that mean scores of skill-aptitude enhancement


examination across four training programs are the same. It indicates that there is a
difference in the effectiveness of training programs (each training program is not
equally effective).
b) = Critical Value of F (according to F distribution table) with 4 degrees of freedom
(number of rows -1 ) in the numerator and 12 degrees of freedom(as before) in the
denominator, F,,,,=3.26.
Since F,<F,., (1.51 <3.26), no variation is seen among experience categories.
164
OPMCO001
Business Statistics

It is concluded that training programs have a legitimate difference in their


effectiveness when accounting for variation in the experience of executives.

CHECK YOUR PROGRESS - III


Sales Manager of a bathroom fittings company is interested to compare the sales turnover of
four main products manufactured by the company. Following quarterly data of sales
turnover was collected from four different markets to control variation in sales turnover due
to market characteristics:

Sales Turnover (in Rs. Crores}

Products

Markets | Faucets | Showers | Sanitaryware| Wall Hung

East 3 4.5 2.5 4

South 5 7 2 2

North 6 6 8 3
West 3.5 5.9 8 2

The sales manager wants to examine whether four products have equal sales potential.
Answer the following:

Q.1 The factors are and


Q.2 What is the blocking variable?

Q.3 Faucets and showers are blocks. (Yes/No).


Q.4 Calculate the values of the various sum of squares.

Q.5 Calculate the values of various mean square terms.


Q.6 Formulate the null and alternate hypothesis.

Q.7_ Test the hypothesis at a = 0.05.

Q.8 Whatis the answer to the sales manager’s problem?

10.6 LET US SUM UP

There are countless situations where the comparison is made between groups to examine
the difference (or variation) between them. Analysis of variance (ANOVA) helps to detect
such differences between groups by comparing their means. The relative distance between
each group-mean and the overall mean of the data is calculated to visualize this subtle
difference between means. The groups may have equal or different means. If relative
differences are equal, the means are not significantly different from one another, and there
is no evidence to reject the null hypothesis. It suggests that groups (or samples) come from
the same common population with no significant variation between them. If relative
165
UNIT 10
Analysis of Variance

NOTES differences are not equal, the means are significantly different from one another, and there
is enough evidence to reject the null hypothesis. It suggests that groups (or samples) come
from different populations with significant variation between them.
Analysis of variance has a family with various types of designs. Designs with a single factor
are called one-way ANOVA and two or more factors are called factorial design. Designs with
two factors are called two-way ANOVA and more than two factors are called MANOVA
(multiple analysis of variance). In an experimental context, one-way ANOVA is known as a
completely randomized design. Two-factor ANOVA comes in two flavors - without replication
and with replication.
Explained source of variation in one-way ANOVA is called between-group variance
(between-column variance) and unexplained variation {error variation) is called within-
group variance.

Since ANOVA is a ratio of variances, for more than two groups, F distribution is used to test
the hypothesis. F ratio in one-way is the ratio between the mean sum of squares column
(MSC) and the mean sum of squares within (MSW). If the F value is more than the critical F
value with C-1 degrees of freedom in the numerator and N-C degrees of freedom in the
denominator the null hypothesis is rejected. In two-way, F value is the ratio between the
mean sum of squares column (MSC) and the mean the sum of squares error (MSE). The ratio
of the mean sum of squares block (MSB) to mean sum of squares error (MSE) explains the
proportion of variation due to differences in blocks (rows) out of error variation. The second
variable essentially controls for variation (blocks the variation) between blocks (rows) for
identifying the variation between groups. Therefore, the row variable is also called the block
variable in the two-way ANOVA. An analysis will be required to do pair-wise comparisons to
know which group or groups are significantly better than the others.

10.7 KEYWORDS

Between-columns variation: The variation between different groups.


Blocking Variable: Variable in the row.

Blocks: Categories of the variable in the row.


Completely randomized design: One-way ANOVA in an experimental design context.
Error variation: Variation other than the explained variation.

Factor: Variable used to divide population (sample) into different groups.


Levels: Different categories in which population (sample) is categorized.

Mean sum of squares: Sum of squares divided by degrees of freedom.


One-way ANOVA: Analysis of variance with a single factor.
Randomized block design: Two-way ANOVA without replication.
Sum of squares: Numerator portion of the expression of variance.

166
OPMC001
Business Statistics

Sum of squares blocks: The sum of squared deviations between blocks means of each block
and the grand mean, weighted by the number of groups.

Sum of squares column: Sum of squared deviations between the sample mean of each
group and the grand mean, weighted by the sample size in each group.

Sum of squares total: Sum of squared deviations between each observation in the data and
the grand mean.
Sum of squares within: Sum of squared deviation between each value of the group and
mean of that group summed over all groups.

Total variation: Variation from all sources put together (sum of explained and unexplained
variation).

Two-factor with replication: Two-way ANOVA when data units corresponding to the column
are repeated over row levels such that there are multiple measurements per cell.
Two-factor without replication: Two-way ANOVA when data units are assigned without any
specific sequence.
Two-way ANOVA: Analysis of variance with two factors.

Within groups variation: The spread of the individual distribution.

10.8 REFERENCES AND SUGGESTED ADDITIONAL READINGS

Levine, David M., Stephan, David F., and Szabat, Kathryn A., 2016, Statistics for Managers
Using Microsoft Excel, 7th edition, Pearson India Education Services Pvt. Ltd., Noida.
Levine, Richard L., Rubin Davis S., Rastogi, Sanjay, and Siddiqui Masood Husain, 2013,
Statistics for Management, 7th edition, Pearson Education Inc., Noida.
Shrivastava, T.N. and Rego Shailaja., 2008, Statistics for Management, Tata McGraw-Hill,
New Delhi.

Gupta, S.P. and Gupta M.P., 2010, Business Statistics, 16" edition, Sultan Chand & Sons, New
Delhi.

10.9 SELF-ASSESSMENT QUESTIONS


Multiple Choice Questions

Q.1 Multiple analysis of variance (MANOVA) requires:


{a} More than two factors with any number of levels.

{b) One factor with many levels.


{c) Two factors with many levels.
(d) Morethan two factors with not more than two levels each.

Q,.2. Whichof the following will give the value SST?


(a) Sum of squared deviations of pairwise difference between row and column
167
UNIT 10
Analysis of Variance

means.

(b} Sum of the squared deviation of all group means from the overall mean of the
data weighted by the number of observations in each group.
(c) Sumofthe squared deviation of all row means from the overall mean of the data
weighted by the number of groups.

(d) Sum ofthe squared deviation of all data values from the overall mean of the data.
Q3 Blocking variables serves the purpose of:

(a) Making the variation between columns less visible.


(b} Finding the ratio between total variance and error variance.

(c) Shifting the proportion of error variance into block variance.


(d) Highlighting
the variation due to blocks rather than groups.
Q4 The randomized block design is the other name for:

(a) Multiple analysis of variance


(b) Two-way ANOVA without replication

(c) Completely randomized design


(d) Two-way ANOVA without replication

Anexperiment has a single factor with five groups and seven values in each group.

Q5 How many degrees of freedom are there in determining the between-group


variation?

(a) 4
(b) 5
(c) 6
(d) 7
Q.6 How many degrees of freedom are there in determining
the within-group variation?
(a) 20
(b) 28
(c) 30
(d) 35
Q.7 How many degrees of freedom are there in determining
the total variation?
(a) 24
(b) 28
(c) 34
(d) 35
168
OPMC001
Business Statistics

In the above experiment having a single factor with five groups and seven values in
each group, If SSC =60 and SST =210

Q8 WhatisSSW?
{a) 270

{b) 150
{c) 350
(d) 285
Q.9 WhatisMSC?
(a) 5

{b) 10
{c) 15
(d) 20
Q.10 Whatis MSW?
(a) 5
(b) 10
(c) 15

(d) 20

10.10 CHECK YOUR PROGRESS — POSSIBLE ANSWERS

CHECK YOUR PROGRESS - |


Q.1 True

Q.2 (b) Ratioofvariances

Q.3 Categorical ornumerical


CHECK YOUR PROGRESS - II

Q.1 | Source of Variation | DF ss MS F


Between-columns 3 240 80 2
Within-groups 28 560 20
Total 31 800 | 25.81

CHECK YOUR PROGRESS- Ill


Q.1 Products and Markets
Q.2 Market
Q.3 No
169
UNIT 10
Analysis of Variance

a
ay
Teo
Ni me Oe Q4 SST=65; SSC =20.13; SSB =11.5; SSE=33.38
Q5 MSC=6.71; MSB = 3.83; MSE=3.71

Q6 Null hypothesis HO: p11 = 12 =p3and alternate hypothesis H1:Not all population means
are equal

Q.7 Fear = 1.81; FCRIT =3.86, Since F..4,< Fears HOis not rejected
Q.12 It is concluded that all group means are equal, meaning all samples come from the
Same population. In other words, there is no variation in sales potential across
different products keeping in consideration the natural variation between the
markets.

10.11 ANSWERS TO SELF-ASSESSMENT QUESTIONS


Q1 (a)
Q.2 (d)
Q3 (c)
Q.4 (b)
a5 (a)
Q.6 (c)
Q7 (c)
Q8 (b)
Q.9 (c)
Q.10 (a)

170
OPMC001
Business Statistics

CHI-SQUARE TEST

STRUCTURE
11.0 Objectives
11.1 Introduction
11.2 Applications
of Chi-Square Test
11.3 Requirement of Chi-Square Test
11.4 Chi-Square Test for the Difference between Proportions
11.5 Chi-Square Test of Independence
11.6 Chi-Square Test of Goodness of Fit
11.7 LetUsSum Up
11.8 KeyWords
11.9 References and Suggested Additional Readings
11.10 Self-Assessment Questions
11.11 Check Your Progress — Possible Answers
11.12 Answers to Self-Assessment Questions

11.0 OBJECTIVES

After reading this unit, you will be able to:


e application sand situations of using the chi-square test
e theconcept of expected frequencies

e hypothesis testing for the difference between two proportions using the chi-
square test

e hypothesis testing for the difference between more than two proportions using
the chi-square test
e hypothesis testing for independence between two categorical variables

e testing the goodness of fit of distribution using the chi-square test

171
UNIT 11
Chi-Square Test

11.1 INTRODUCTION

A super specialty hospital has four clinics — cardiology, nephrology, hepatology, and pul
monology. A different number of patients, those who are overweight and those who are not
overweight, visit each of these clinics. The director of the hospital has a belief that
overweight patients are more likely to have some diseases as compared to other patients.
She thinks that being overweight is the reason for the difference in the number of patients
visiting various clinics for treatment. To examine her prepositions, she collects data from a
sample of 450 patients who visited the hospital in a particular week and records the results in
the following contingency table:
Table 11.1: Number of Patients

Cardiology | Nephrology | Hepatology | Pulmonology Total

Overweight 70 77 57 89 293

Not Overweight 35 43 38 41 157

Total 105 120 95 130 450

She also has a feeling that the incidence of these diseases is related to the liquor
consumption frequency of patients. She records the data about liquor consumption
frequency and incidence of each of the diseases in the following contingency table:
Table 11.2: Number of Patients

Cardiology |Nephrology | Hepatology | Pulmonology | Total

Don’t consume liquor 25 10 7 8 50

Once a week 40 27 14 19 100

Two-Three times a week 22 47 39 17 125

Four or more times a week 18 36 35 86 175

Total 105 120 95 130 450

In addition to the above information, there is a prevailing theory in the field of medicine that
the proportions of patients who are overweight and those who are not overweight in the
overall population are equal.

The director wants to examine:

(1) Whether the proportions of patients having each type of the disease are different
because of their being overweight?
(2) Whetherthetype of disease and the liquor consumption frequency are related?

172
OPMCO001
Business Statistics

(3) Whether the observed categorical distribution of patients’ weights is the same as the NOTES
theoretical distribution of patients’ weight?

In previous units, we have learned the concept of hypothesis testing including (one-sample
tests, unit 8), the procedure for comparing means of two populations and two proportions
(two-sample tests, units 9), and the procedure for comparing means of more than two
populations (analysis of variance, unit 10). None of our learnings of previous units will be of
help to test the medical director’s prepositions about the incidence of various diseases
(measured by the number of patients visiting various disease clinics) based upon the weights
of patients, and about the relationship between disease and the liquor consumption
frequency, because we are required to test the difference among more than two
independent population proportions (cardiology, nephrology, hepatology, and pulmonology)
foracategorical variable with two levels (overweight and not overweight).
In this chapter we extend the concept of hypothesis testing to (a) analyze differences
between various population proportions (based on two or more samples); (b) test the
hypothesis of independence in the joint of responses of two categorical variables, and (c} to
test the goodness of fit of a distribution. We will learn to use chi-square to test the above
hypotheses and to examine three prepositions made by the medical director (in the above-
mentioned case of a super-specialty hospital). The hypothesis testing procedure uses a test
statistic that is approximated by a chi-square (X) distribution.

11.2 APPLICATIONS OF CHI-SQUARE TEST

As discussed in the previous section, a chi-square can be used in all such situations where we
want to examine - the difference between various population proportions for a categorical
variable of interest; the relationship between two categorical variables, each with many
levels; and the appropriateness of a distribution. An illustrative list of applications of the chi-
square testis given below:

e Examination of difference between various types of internet users in the


proportionthat dislikes online ads.

e Comparison of difference between hotels for the proportion of guests who are
likely to return.
® To check the relationship between the primary reason for buying online and
customer satisfaction across various e-commerce websites.

e To check the uniformity in the number of employees who leave an organization


every year.

CHECK YOUR PROGRESS -I


Q.1 Can we use the chi-square test to see whether the type of funding received by start-
ups {classified as crowd funding, angel-investor funding, or bank finance) is
independent of the stage at which funding is received (classified as emergence,
survival, growth, or sustenance)? (Yes / No)
173
UNIT 11
Chi-Square Test

NOTES Q.2 Can we use the chi-square test to determine whether the average summer
temperatures are different across northern, southern, eastern, and western parts of
India? (Yes/No)

Q.3 Results of hypothesis tests of difference between two samples by using the Z test and
X’ test will be different. (True/False)

11.3 REQUIREMENTS OF CHI-SQUARE TEST


11.3.1 CATEGORICAL VARIABLES

Variables involved are categorical type. The population proportions (or populations) are
different categories or levels. In the Table 11.1 types of disease with four different levels is a
categorical variable. The second variable containing the items of interest (patients who
smoke in Table 11.1) is also categorical (with two levels).

11.3.2 CONTINGENCY TABLE

Count of items under various categories of variables of interest is recorded in either 2xcorr
x c contingency tables. Table 11.1 is a 2 x 4 contingency Tables with a total of rows and
columns
as 450.

11.3.3 FREQUENCY

The count of items is of particular importance in chi-square analysis because it requires a


comparison between observed and expected frequencies. Chi-square test checks whether
the difference between observed and expected frequencies is significant or not.

Observed Frequencies (f,}: These are the frequencies that are observed in a sample (or
collected data}. In other words, these are the counts of a sample distribution. Counts in table
11.1 are observed frequencies.

Expected Frequencies (f,): These are the frequencies or counts that are expected in a
theoretical distribution. Expected frequencies can be known based onsome theory or canbe
calculated as proportions from the sample data. For example, a theoretical distribution of
500 tosses of a coin will contain 250 heads and 250 tails; similarly, a theoretical distribution
of results of a dice experiment will contain an equal number of outcomes - 1, 2, 3, 4,5 and 6.
These outcomes are expected frequencies. The expected frequencies for data in Table 11.1
canbe calculated as proportions of the total sample size.

The expected frequency {f,) ofa cell is calculated as follows:

fe= (corresponding column total) X {corresponding row total}

(total sample size)


Note: Each f, value should be at least 5 for a chi-square test to give accurate results.

174
OPMCO001
Business Statistics

CHECK YOUR PROGRESS - Il NOTES


Q.1 Compute the expected frequencies for the following contingency table?

Table 11.3

X Y Total

| 10 50 60

II 30 110 140

Total 40 160 200

11.4 CHI-SQUARE TEST FOR THE DIFFERENCE BETWEEN PROPORTIONS


Chi-square test for the difference between proportions (or equality of proportions) has a
factor of interest with two or more levels. These levels represent samples drawn from
independent populations. For example, in Table 11.1, cardiology, nephrology, hepatology,
and pulmonology are four levels of a disease-type The categorical responses in each group or
level are classified into two categories - an item of interest and not an item of interest
(overweight and not overweight patients as in example 11.1).
When there are two levels of the factor of interest, the chi-square test is done for testing the
difference between two proportions. When there are more than two levels of the factor of
interest the chi-square test is done for testing the difference between more than two
proportions.

11.4.1 CHI-SQUARE TEST FOR THE DIFFERENCE BETWEEN TWO PROPORTIONS


In unit 9, you have studied Z-test for the difference between two proportions. In this section,
the data are examined from a different perspective. The hypothesis testing procedure uses a
test statistic that is approximated by a chi-square (X’) distribution. The results and
conclusions of X’ tests are equivalent to those of Z-tests.
To compare the counts of categorical responses between two independent groups
(proportions) a 2 x 2 contingency table displays the frequency of occurrence of items of
interest and items not of interest for each group. The following contingency table displays
the situation for testing the difference between two proportions:

175
UNIT 11
Chi-Square Test

Table 11.4
(Preference of Mobile Phones for Online Shopping by Employees)

Profession
Corporate | Government Total
Employees | Employees
Prefer Mobile Phone for Shopping 166 154 320

Don’t Prefer Mobile Phone for Shopping 68 112 180

Total 234 266 500

Hypothesis Testing

Null Hypothesis H,: 1, =7, states that two population proportions are equal, and Alternative
Hypothesis H,: 71, #7, states that two population proportions are not equal.

The X’ test statistic is equal to the squared difference between the observed and expected
frequencies, divided by the expected frequency in each cell of the table, summed over all
cells of the table.

Itis calculated by using the following expression,

X= » ( ne )
all cells

We reject the null hypothesis if the calculated value of X’ is greater than X’ distribution value
with 1 degree of freedom. A chi-square test with r rows and c columns has (r-1) x (c-1)
degrees of freedom. Since the test of difference between two populations has r=2 andc=2,
it has (2-1) x (2-1) = 1x 1= 1 degrees of freedom. Rejecting a null hypothesis will mean that
two proportions are different for a categorical variable of interest.
If the calculated value of X’ statistic is less than or equal to X’ distribution value with 1 degree
of freedom we fail to reject the null hypothesis to conclude that two proportions are not
different for a categorical variable of interest. It means that sample proportions that we
compute for each of the groups would differ from each other only by chance. Each sample
would then provide an estimate of the common population parameter, 1. A statistic that
combines these two separate estimates into one overall estimate of the population
parameter provides more information than either of the two estimates could provide by
itself.

Example 11.1: Sample data from 500 employees belonging to two different types of
professions (corporate and government) was collected to know whether these two groups of
employees were different in terms of their preference for use of mobile phones for online
shopping. Formulate the appropriate hypothesis, test it at a 5% level of significance, and
determine whether the two groups are different.

176
OPMC001
Business Statistics

Table 11.5: Preference of Mobile Phones for Online Shopping by Employees

Profession
Corporate | Government Total
Employees | Employees ore
Prefer Mobile Phone for online Shopping 166 154 320

Don’t Prefer Mobile Phone for online Shopping 68 112 180

Total 234 266 500

Solution: Since the objective is to determine whether two groups are different from each
other about their preference for using mobile phones for doing online shopping, the
condition entails the use of chi-square for testing the difference between two proportions
(corporate employees and government employees).

Step 1: Hypothesis Formulation


If m, and m,represent population proportions of corporate and government employees
respectively, the corresponding hypothesis are,
Null Hypothesis H,; m,= 1, states that two population proportions are equal (not different),
and

Alternative Hypothesis H,: nm, # 7, states that two population proportions are not equal
(different)

Step 2: Observed Frequencies (fo)

Table 11.6

Profession

Corporate | Government Total


Employees Employees on

Prefer Mobile Phone for online Shopping 166 154 320

Don’t Prefer Mobile Phone for Online Shopping 68 112 180

Total 234 266 500

Step 3: Calculation of Expected Frequencies (f.)

The expected frequency (f.) of each cell is calculated as follows:

{corresponding column total) X (corresponding row total)

(total sample size)

177
UNIT 11
Chi-Square Test

Table 11.7

Profession
Corporate Government Total
Employees Employees ore
Prefer Mobile Phone for online Shopping |(234 x 320)/500 | (266x320)/500 | 320

Don’t Prefer Mobile Phone for online Shopping| (234 x 180)/500| {266x180)/500 | 180

Total 234 266 500

Table 11.8: Expected Frequencies (fe)

Profession
Corporate Government Total
Employees Employees
Prefer Mobile Phone for online Shopping 149.76 170.24 320

Don’t Prefer Mobile Phone for online Shopping 84.24 95.76 180

Total 234 266 500

Step 4: Calculation of X’ Value


Table 11.9

CellNo. | fo fe | (ofer | Ufo-tel/fe


R1iC1 166 149.76 263.74 1.76

R1C2 154 170.24 263.74 1.55

R2C1 68 84.24 263.74 3.13

R2C2 112 95.76 263.74 2.75

> (fo-fe)’/fe = 9.20

Fo-fe 2
X=
2
( fe ) =9.20

all cells

Table value of X’ with 1 degree of freedom at a = 0.05 is 3.841(from the table)

Since the calculated value of X’ statistic is > table value of X’, we reject the null hypothesis
that there is no difference in preference of the mobile phone for online shopping between
two groups of employees. This means two proportions (groups of employees) are
178
OPMC001
Business Statistics

significantly different. We conclude that the proportion of corporate employees who prefer
the mobile phone for online shopping is different from the proportion of government
employees who prefer mobile phones for online shopping.

11.4.2 CHI-SQUARE TEST FOR DIFFERENCE AMONG MORE THAN TWO PROPORTIONS

In this section, we extend the chi-square test to compare more than two independent
populations. Instead of two proportions, we have many numbers of proportions. We use
letter c to represent the number of independent proportions. Thus, the contingency table
will have 2 rows and c columns.

Hypothesis Testing

Null Hypothesis H,: 1m, = W, = .... = m, states that there are no differences among the c
population proportions, and
Alternative Hypothesis H,:states that not all the c population proportions are equal.
The number of degrees of freedom will be = (2-1) X (C-1)
Rest of the procedure for calculating the values of expected frequencies and X’ statistic and
decision criteria for rejecting or not rejecting the null hypothesis is the same as in the case of
two proportions.
Example 11.2: A super specialty hospital has four clinics — cardiology, nephrology,
hepatology, and pulmonology. A different number of patients, those who are overweight
and those who are not overweight, visit each of these clinics. The director of the hospital
wants to examine whether being overweight is the reason for the difference in the number
of patients visiting various clinics for treatment. She collects data from a sample of 450
patients who visited the hospital in a particular week and records the results in the following
(2x) contingency table:

Table 11.10: Number of Patients

Cardiology | Nephrology | Hepatology | Pulmonology Total

Overweight 70 77 57 89 293

Not Overweight 35 43 38 41 157

Total 105 120 95 130 450

Test the hypothesis at a 5% level of significance.


Solution:
Step 1: Hypothesis Formulation
If m,, 7, 1, and m, represent proportions of patients who visit cardiology, nephrology,
hepatology, and pulmonology clinic respectively, the corresponding hypotheses are,
179
UNIT 11
Chi-Square Test

Null Hypothesis H,; nm, =, =", = 1, states that the proportion of patients visiting each of the
four clinics are equal, and

Alternative Hypothesis H,: not all the four proportions of patients are equal.

Step 2: Observed Frequencies (fo)

Table 11.11

Cardiology | Nephrology | Hepatology | Pulmonology Total

Overweight 70 77 57 89 293

Not Overweight 35 43 38 41 157

Total 105 120 95 130 450

Step 3: Calculation Expected Frequencies (fe)

Table 11.12

Cardiology | Nephrology | Hepatology | Pulmonology Total

Overweight (105 x 293)/450 | (120 X 293)/450| (95 X 293)/450 | (130 X 293)/450 293

Not Overweight |(105 x 157)/450 | (120 X 157)/450| (95 X 157)/450 | (130 X 157)/450 157

Total 105 120 95 130 450

Table 11.13Expected Frequencies (fe)

Cardiology | Nephrology | Hepatology | Pulmonology Total

Overweight 68.37 78.13 61.86 84.64 293

Not Overweight 36.63 41.87 33.14 45.36 157

Total 105 120 95 130 450

180
OPMCO001
Business Statistics

Step 4: Calculation x’ Value


Table 11.14

Cell No. fo fe (fo-fe)’/fe

R1C1 70 68.37 0.04

R1C2 77 78.13 0.02

R1c3 57 61.86 0.38

R1C4 89 84.64 0.22

R2C1 35 36.63 0.07

R2C2 43 41.87 0.03

R2C3 38 33.14 0.71

R2C4 41 45.36 0.42

5 (fo-fe)'/fe = 1.89
X’= 189
Table X’ value with 3 degrees of freedom and a=0.05 is ((2-1) X (4-1)) = 7.815 (from the table)

Since calculated X° value (1.89) < the table X’ value (7.815), we fail to reject the null
hypothesis. Hence, we do not have sufficient evidence to say that the population
proportions are different from each other. We conclude that there is no difference between
the proportion of patients having four different diseases based on their being overweight or
not. it means that there is no evidence to conclude that patients who are overweight have a
higher proportion of a particular disease than the other diseases(s). Therefore, there is no
relation between weight and four types of diseases {related to heart, kidney, liver, and lungs).
The difference in the incidence of diseases (measured by the number of patients) is a random
variation or chance variation that may be attributed to other factors but not to the weight of
patients,

Chi-square test has been used as atest of the relationship between two categorical variables
to conclude that there is no relationship between the type of disease and the weight of
patients. In other words, these two categorical variables are independent. The following
section covers the applicability of chi-square as atest of independence.

11.5 CHI-SQUARE TEST OF INDEPENDENCE

Chi-square is also used as a test of independence between two categorical variables. In atest
of the difference between proportions, there is one factor of interest with two or more levels
(representing independent proportions) and a categorical variable with two levels
(representing items of interest and items not of interest). In a test of independence, the
181
UNIT 11
Chi-Square Test

second categorical variable can also have more than two levels. Thus, a test of independence
has two factors of interest each with two or more levels. The contingency table has r rows
and ccolumns.

Hypothesis Testing

Null Hypothesis HO:The categorical variables are independent, and


Alternative Hypothesis H1:The categorical variables are dependent.

The number of degrees of freedom will be = (r-1) X (c-1)


Rest of the procedure for calculating the values of expected frequencies and X’- statistic and
decision criteria for rejecting or not rejecting the null hypothesis is the same as in the case of
atest of the difference between two or more proportions.
Example 11.3: Director of a super-specialty hospital that has four clinics — cardiology,
nephrology, hepatology, and pulmonology wants to examine whether the incidence of these
diseases is related to the liquor consumption frequency of patients. She records the data
about liquor consumption frequency and incidence of each of the diseases in the following
contingency table:

Table 11.15: Number of Patients

Cardiology | Nephrology | Hepatology|Pulmonology| Total

Don’t consume liquor 25 10 7 8 50

Once a week 40 27 14 19 100

Two-Three times a week 22 47 39 17 125

Four or more times a week 18 36 35 86 175

Total 105 120 95 130 450

Test the hypothesis at a 5% level of significance.

Solution:
Step 1: Hypothesis Formulation

Null Hypothesis H,: The categorical variables are independent, and


Alternative Hypothesis H,: The categorical variables are dependent.

The number of degrees of freedom will be = (r-1) X (c-1)


Null Hypothesis H,: There is no relationship between the type of disease and the liquor
consumption frequency of patients, or

Type of disease is independent of the liquor consumption frequency of patients.


Alternative Hypothesis H,: There is a relationship between the type of disease and the liquor
182
OPMCO001
Business Statistics

consumption frequency of patients, or NOTES


Type of disease depends on the liquor consumption frequency of patients.

Step 2: Observed Frequencies (fo)

Cardiology | Nephrology | Hepatology|Pulmonology) Total


Don’t consume liquor 25 10 7 8 50
Once a week 40 27 14 19 100
Two-Three times a week 22 47 39 17 125
Four or more times a week 18 36 35 86 175
Total 105 120 95 130 450

Step3: Calculation of Expected Frequencies (fe)

Cardiology |Nephrology | Hepatology|Pulmonology| Total


Don’t consume liquor 11.67 13.33 10.56 14.44 50
Once a week 23.33 26.67 21.11 28.89 100
Two-Three times a week 29.17 33.33 26.39 36.11 125
Four or more times a week | 40.83 46.67 36.94 50.56 175
Total 105 120 95 130 450

Step 4: Calculation of X’ value

Fo Fe (fo-fey'/fe
25 11.67 15.24
40 23.33 11.90
22 29.17 1.76
18 40.83 12.77
10 13.33 0.83
27 26.67 0.00
47 33.33 5.60
36 46.67 2.44
7 10.56 1.20
14 21,11 2.40
39 26.39 6.03
35 36.94 0.10
8 14.44 2.88
19 28.89 3.39
17 36.11 10.11
86 50.56 24.85
> (fo-fe)’/fe= 101.50 183
UNIT 11
Chi-Square Test

xX’ = 101.50

Table value X’ value with 9 degrees of freedom and a = 0.05 is ((4-1) X (4-1)) = 16.919
Since calculated X’ value (101.50) > the table X’ value (16.919), we reject the null hypothesis
(the type of disease is independent of the liquor consumption frequency of patients). We
conclude that there is a relationship between two categorical variables, the type of disease
and the liquor consumption frequency of patients and that the type of disease depends on
the liquor consumption frequency of patients. The variation in the number of patients
between different types of diseases is not just by chance but is attributed to liquor
consumption.

With the chi-square test of the difference between proportions, we can only conclude that
proportions are not equal but cannot conclude which proportions differ. To determine which
proportions differ, a multiple comparisons procedure like the Marascuilo procedure, which is
out of scope in the present discussion, can be used.

CHECK YOUR PROGRESS -Ill

The following contingency table provides results of a survey conducted for the primary
reason for shopping online for customers of three popular e-commerce platforms:
Table 11.16: Primary Reason for Shopping Online

Primary Reason Amazon Flipkart | Snapdeal Total

Discount 100 70 30 200

Product Variety 75 50 25 150

Quick Delivery 65 20 15 100

Good Return Policy 10 10 30 50

Total 250 150 100 500

Q.1 Calculate the number of degrees of freedom.

Q.2 Canwe test the hypothesis of difference among the proportion of customers of three
e-commerce platforms based on their primary reason for shopping by using the chi-
square test? Give the reason in support of your argument.

Q.3 Can we test the hypothesis of the relationship between the primary reason for
shopping and the preferred e-commerce platform?

11.6 CHI-SQUARE TEST OF GOODNESS OF FIT

In addition to the test of hypothesis about the difference in population proportions and
independence between two categorical variables, chi-square can also be used to decide
184
OPMC001
Business Statistics

whether a particular distribution is appropriate or not. Testing the appropriateness of a


distribution is an important capability of chi-square since decision-making involves choosing
a probability distribution (normal, Poisson, binomial, etc.) to represent the data collected to
make decisions. The chi-square test allows us to test a theoretical categorical distribution
against an observed categorical distribution. It allows us to examine whether there is a
significant difference between a theoretical frequency distribution and an observed
frequency distribution. Thus, we can determine the goodness of fit of a theoretical
distribution (that is, how well it fits the distribution of data that we have observed).

In the case of a super-specialty hospital medical director wants to examine whether the
prevailing theory (that proportion of patients who are overweight and those who are not
overweight are equal) holds true or not when compared with the observed distribution of
the weight of patients. If the observed frequencies are close enough to the expected
frequencies we can conclude that there is no significant difference between the theoretical
and observed distributions. Otherwise, we may cast doubts about the prevailing theory that
the probability of both types of patients (overweight and not overweight) is the same. We
use the procedure of chi-square discussed in previous sections to accept or reject our null
hypothesis.
Distribution of Weight of Patients

Number of Patients

Overweight 293

Not Overweight 157

Total 450

Hypothesis

If m1 and m2 represent population proportions of overweight patients and not-overweight


patients respectively, the corresponding hypotheses are:
Null Hypothesis H,; 1, = 1, two population proportions are equal, and

Alternative Hypothesis H,: 1, #1, states that two population proportions are not equal.
Wewill test the above states' hypotheses at a 5% level of significance.

Number of degrees of freedom will be = Number of categories minus 1 =2-1=1

Calculation of Chi-square Value

fo fo (fo-fe)’/fe
Overweight 293 225 20.55
Not Overweight 157 225 20.55

Total 450 450 > (fo-fe)’/fe = 41.10


185
UNIT 11
Chi-Square Test

X’= 41,102
Table X’ value with 1 degree of freedom ata = 0.05 is 3.841

Since calculated X’ value (41.102) > the table X’ value (3.841), we reject the theoretical
proposition that two types of patients (overweight and not-overweight) are equal in number.
The present sample does not provide any evidence in support of the theory.
Example 11.4: A salesman has five accounts to visit per day. It is assumed that the variable
(sales) follows a binomial distribution with the probability of selling each account being 0.4.
Given the following distribution of sales per day, can it be concluded that the data of sales
follow a binomial distribution at a 5% level of significance?
Table 11.6

Number of Sales Per Day 0o;1/2)3)4/)]5

Frequency
of Number of Sales} 19 | 41 | 60/20! 6 | 3

Solution:
Step 1: Hypothesis Formulation

Null Hypothesis H,: A binomial distribution with a probability of 0.4 is a good description of
sales, and
Alternative Hypothesis H,: A binomial distribution with a probability of 0.4 is not a good
description of sales.
Degrees of freedom = Number of categories—1=6-1=5
Significance Level: 0.05

Step 2: Calculation of Binomial Frequencies (Expected Frequencies, fe)


1x nx

P(X=x)= np tip)” Where p=probability =0.4, n=5 and x=the number of times, N= total
x (n-x)
number of observations=10+41+60+20+6+3=140

x fo P(X) fe = N x P(X)
0 10 0.07776 10.8864
1 41 0.2592 36.288
2 60 0.3456 48.384
3 20 0.2304 32.256
4 6 0.0768 10.752
5 3 0.01024 1.4336

186
N=Sfo= 140 1 Sfe= 140
OPMCO001
Business Statistics

Step 3: Calculation of X’ value

fo fe (fo-fe}'/fe
10 10.8864 0.072173
41 36.288 0.611854
60 48.384 2.788762
20 32.256 4.656794
6 10.752 2.100214
3 1.4336 1.711502
> (fo-fe)’/fe = 11.9413
X’= 11.9413
Table value X’ value with 5 degrees of freedom and a = 0.05 is 11.071

Since calculated X’ value (11.9413) > the table X’ value (11.071), observed frequencies are
too far away from the expected frequencies to conclude that the variable, sales, follow a
binomial distribution. Therefore, we fail to accept the null hypothesis.

CHECK YOUR PROGRESS -IV

Q.1 From a sample survey of examination results of 500 students, it was found that 220
had failed, 170 had secured third class, 90 were placed in second class and 20 got first
class. Are these figures commensurate with the general examination result which is in
the ratio 4:3:2:1 for the various categories, respectively at a 5% level of significance?

11.7 LETUSSUM UP

When there are more than two populations (or proportions) hypothesis testing for the
difference between proportions is not possible by the procedures learned in previous units.
To test hypotheses under such situations along with testing the hypotheses of independence
(or relationship) between two categorical variables, and testing the goodness of fit of a
distribution we use the chi-square test. The hypothesis testing procedure uses a test statistic
that is approximated by a chi-square (X’) distribution.
To conduct a chi-square test, sample data consisting of observed frequencies of categorical
variables are recorded in a contingency table and expected frequencies are calculated based
on the proportions of items of interest. A chi-square statistic then compares the observed
frequencies against expected frequencies to check whether the two frequencies are
significantly far away to accept or reject a null hypothesis. If there is no significant difference
between the two frequencies, the null hypothesis of no difference between population
proportions holds, otherwise not. This requires comparing the calculated X’ value with the X° 187
UNIT 11
Chi-Square Test

NOTES value derived from a table with a particular degree of freedom and a particular level of
significance. If the calculated X’ value is less than or equal to critical X’ value we fail to reject
the null hypothesis otherwise not.

The same procedure is followed to test the difference between two proportions or between
more than two proportions, to test the relationship between two categorical variables with
any number of levels, and to test the goodness of fit of a theoretical distribution. This way will
be able to decide how far our assumptions about a distribution hold before we make
decisions based on such assumptions.

11.8 KEYWORDS

Categorical Variable: A variable for which responses are recorded in categories.

Contingency Table : A table used to record joint responses of categorical variables such that
the totals of row-counts and column-counts are equal.

Degrees of Freedom: Number of values one can freely choose.

Expected Distribution: Distribution of data values of asample.


Expected Frequencies: Frequencies or counts expected in a theoretical distribution.

Factor: Variable used to divide the population into different groups.


Levels: Different categories in which the population is divided.

Observed Frequencies: Frequencies or counts ina sample distribution.


Proportions: Parts (or samples) of same or different populations.

Theoretical Distribution: Distribution of data values that follow certain assumptions.

11.9 REFERENCES AND SUGGESTED ADDITIONAL READINGS

Levine, David M., Stephan, David F., and Szabat, Kathryn A., 2016, Statistics for Managers
Using Microsoft Excel, 7" edition, Pearson India Education Services Pvt. Ltd., Noida.

Levine, Richard L., Rubin Davis S., Rastogi, Sanjay, and Siddiqui Masood Husain, 2013,
Statistics for Management, 7" edition, Pearson Education Inc., Noida.
Shrivastava, T.N. and Rego Shailaja., 2008, Statistics for Management, Tata McGraw-Hill,
New Delhi.

Gupta, S.P. and Gupta M.P., 2010, Business Statistics, 16" edition, Sultan Chand & Sons, New
Delhi.

11.10 SELF-ASSESSMENT QUESTIONS


Q.1 Whichofthe following cannot be done by using a chi-square test?

{a} Test of difference between a theoretical and an expected probability distribution.


(b) Test of independence between two categorical variables.
188
OPMC001
Business Statistics

(c) Test of difference in means of more than two groups.


(d) Test of difference in proportions of more than two samples.

Q.2 How many degrees of freedom will be required to calculate table X’ value for a
distribution of sample size n with r rows andc columns?
(a) n—-(rxc)

(b) (r—1)x(c-1)
(c) rxec

(d) (r—1)+(c-1)
Q.3 Which of the following is not required to test a chi-square test hypothesis?

(a) Chi-square statistic


(b) Degrees of freedom
(c) Level of significance
(d) Category Median

Q4 The minimum value of the expected frequency of every cell to apply the chi-square
test should be:
(a) 5

(b) 10
(c) 1

(d) Nocondition of minimum value applies

Questions 5-7 are based on the following information:


A brand manager is concerned that her brand’s share may be unevenly distributed
throughout the country. She surveyed a total of 1200 consumers, 300 in each of the four
geographic locations (north, south, east, and west), and asked whether they purchase the
brand or not. The following results were obtained.

Table 11.17

Geographical Region

North/South} East | West | Total

Purchase the Brand 130 | 135 | 155/ 140 | 560

Do not Purchase the Brand| 170 | 165 | 145/| 160 | 640

Total 300 | 300 | 300) 300 | 1200

State the null and alternative hypotheses.

189
UNIT 11
Chi-Square Test

Q.5 State the null and the alternative hypothesis.

Q.6 Calculate expected frequencies.


Q.7 Ata=0.10 test whether the brand share is equal across four geographical
locations.
Questions 8-10 are based on the following information:
The probability distribution, given below, of the number of customer arrivals per hour ina
bank, is assumed to be following a Poisson distribution with A=3.

Table 11.18

No of arrivals per hour 0|1/2) 3) 4 | Sormore

Number of hours 20 | 57 | 98) 85/ 78 62

Q.8 State the null and alternative hypotheses.


Q.9 Calculate expected frequencies.

Q.10 Determine X’ value and test the hypothesis at a = 0.10

11.11 CHECK YOUR PROGRESS — POSSIBLE ANSWERS

CHECK YOUR PROGRESS - |

Q.1 Yes

Q.2 No

Q.3 False
CHECK YOUR PROGRESS - ll

Q1
x Y Total

l 12 48 60

Il 28 112 140

Total 40 160 200

CHECK YOUR PROGRESS - Ill

Q1 6

Q.2. No, because a chi-square test for testing the difference between proportions can be
applied ifthe variable of interest has only two levels {one of interest and the other not
of interest). In this case we have four primary reasons for shopping.
Q.3 Yes
190
OPMCO001
Business Statistics

0.4

fo fe (fo-fey'/fe
Failed 220 200 2

Third Class 170 150 2.666667

Second Class 90 100 1

First Class 20 50 18

Total 500 500 > (fo-fe)’/fe = 23.67

X’=23.67
Table value X’ value with 3 degrees of freedom and a=0.05is 7.815

Since calculated X’ value (23.67) > the table X’ value (7.815), the observed frequencies are
significantly different from the expected frequencies. Therefore, we fail to accept the null
hypothesis that results are in the ratio 4:3:2:1. We conclude that results are different from
the general ratio of 4:3:2:1 across various categories.

11.12 ANSWERS TO SELF-ASSESSMENT QUESTIONS

Q1 (c)
Q.2 (b)
Q.3 (d)
Q.4 (a)
Q.5 Hypothesis

If 7,, TL, TU,, and ml, represent consumer proportions of four geographical regions the
corresponding hypothesisis,

Null Hypothesis H,: m, = 7, = m, = 7, states that consumer proportions across four


geographical regions are equal, and
Alternative Hypothesis H,: states that not all consumer proportions are equal:

Q.6 Expected Frequencies

Geographical Region

North/South| East | West | Total

Purchase the Brand 140 | 140 | 140) 140 | 560

Do not Purchase the Brand; 160 | 160 | 160| 160 | 640

Total 300 | 300 | 300) 300 | 1200


191
UNIT 11
Chi-Square Test

Q.7 Chi-Square test

fo fe (fo-fe)'/fe
130 140 0.71

170 160 0.63

135 140 0.18

165 160 0.16

155 140 1.61

145 160 1.41

140 140 0.00

160 160 0.00

> (fo-fe)’/fe = 4.69

X’=4.69
At a=0.10 with 3 degrees of freedom table value X’=6.251

Since calculated X’ value (4.69) < the table X’ value (6.251), the brand share is found to be
equal across four geographical locations.

Q.8 Hypothesis Formulation


Null Hypothesis H,: Number of customer arrivals per hour follows a Poisson distribution;

Alternative Hypothesis H,; Number of customer arrivals per hour doesn’t follow a Poisson
distribution.
Q.9 Expected Frequencies

x Observed Freq. {fo) P(X) | Expected Freq. {fe) = P{X) x N


0 20 0.05 19.92

1 57 0.15 59.76

2 98 0.22 89.64

3 85 0.22 89.64

4 78 0.17 67.23

5 or More 62 0.18 73.79


Total N = 400 1 400

192
OPMCO001
Business Statistics

Q.10 Chi-Square Test

Observed Freq. (fo) | Expected Freq. (fe) (fo-fe)’/fe

20 19,92 0.00

57 59.76 0.13

98 89.64 0.78

85 89.64 0.24

78 67.23 1.73

62 73.79 1.88

400 400 5 (fo-fe)'/fe = 4.76

X’ =4.76
At a = 0.10 with 5 degrees of freedom table value X’ = 9.236

Since calculated X’ value (4.76) < the table X’ value (9.236), the null hypothesis, the number
of customer arrivals per hour follows a Poisson distribution, is true at 10% level of
significance.

193
SIMPLE LINEAR REGRESSION

STRUCTURE
12.0 Objectives
12.1 Introduction
12.2 Application of Regression in Business and Management
12,3 Types
of Relationship
12.4 Measuring Simple Relationships
12.5 Simple Linear Regression
12.6 LetUsSumUp
12.7 KeyWords
12.8 References and Suggested Additional Readings
12.9 Self-Assessment Questions
12.10 Check Your Progress— Possible Answers
12.11 Answers to Self-Assessment Questions

12.0 OBJECTIVES
After reading this unit, you will be able to learn:

® conceptofthe relationship between variables

e types ofrelationships
® meaning of correlation and various methods of measuring the correlation
between variables
e meaningof regression

e simplelinear regression model


e types of variables in a simple regression model
® construction of the line of best fit for predicting the value of a dependent variable

194 e meaningoftheslope and intercept ofa regression line


OPMCO001
Business Statistics

e estimating the value of a dependent variable from the value of an independent NOTES
variable

e Measures of variation in regression analysis

12.1 INTRODUCTION

The word relation or relationship is commonly used to describe events and situations
experienced by us in our day to day lives. Mothers inculcate the habit of cleanliness among
their children to keep them healthy. You have grown up listening that hard work leads to
success. The teacher of a management institute helps in improving the communication skills
of her students for their successful career in management. Astrologers relate the events in
an individual's life with the position of stars. There are countless situations like these where
we believe that one event leads to another event because there is a spirit of the relationship
between events. When we have quantifiable information about events, we can check
whether there is a relationship between events and not. The technique used to study the
relationship between events (variables) is called correlation.

Let’s understand this by taking an example of the demand for ice-cream - related to
maximum temperature during the day.

Example 12.1: If you visit India Gate, the historic monument that commemorates soldiers
who lost their lives in World War I, one thing that cannot escape one’s attention is ice-cream
pushcarts. India Gate is the country’s single largest marketplace for selling ice-cream and
accounts for almost 11% of overall ice-cream sales in the capital. It is also a favourite ground
for ice-cream manufacturers to test their products and offerings. Estimated sales of ice-
cream on India Gate on peak summer weekends are Rs. 18-20 lakhs per day suggesting a
relationship between the demand for ice-cream and temperature during the day.
As monsoons are approaching, an ice-cream manufacturer wants to predict its demand for
the coming weekends by using the data of the quantity of ice-cream sold and maximum
temperature readings on weekends. The manufacturer records the following data of
quantity sold and maximum temperature for various weekend days of June.
Table 12.1

Day June6 |June? | June 13|)June14| June 20) June 21) June 27} June 28

Max. Temp. (oC)| 38 36 42 41 40 38 41 43

Sales (’00 kg} 50 44 49 | a8 52 47 50 55

The maximum temperature of the day (measured in centigrade) and sales (measured in a
hundred kilograms) are quantifiable variables. If there is some evidence that there is a
relationship between these variables, this relationship between them can be studied by
using the principles of correlation.

Variables always do not relate to each other in the same manner, but they relate differently,
both in terms of direction and strength of the relationship. Correlation helps in determining
the direction and strength of the relationship between variables. Ingrained in the spirit of
195
UNIT 12
Simple Linear Regression

NOTES relationship is the spirit of dependence among variables. The quantity of weight lost by
participating in a weight loss program depends upon the number of calories consumed. The
quantity of food consumed depends upon its taste. In example 12.1, the quantity of ice-
cream sold depends upon temperature during the day. There are innumerable relationships
where some variables are dependent, and others are independent. When it is possible to
identify dependent and independent variables in a relationship, it is also possible to
establish a relationship between them mathematically and to predict the value of the
dependent variable(s) if independent variables are known. The technique used to represent
the relationship between dependent and independent variables and to predict the amount
of change in the dependent variable by changing the value of independent variables is called
regression. While correlation is concerned with measuring the relationship between
variables, regression is used to predict the variation in the value of the dependent variable
for variation in the value of the independent variable. Ice cream manufacturers can use the
concept of regression analysis to predict the demand (sales in kg) of ice cream based on
information about maximum temperature during the day.

12.2 APPLICATION OF REGRESSION IN BUSINESS AND MANAGEMENT

Irrespective of the field of study, regression is used in all such instances where there is an
instance of a relationship between variables and it is possible to identify which of them are
dependent (predicted) and which are independent (predictor}. Hence, its applicability is
universal. Its application in the field of economics, business, and management can be
visualized in the following examples:
® Relationship between advertising expense and sales of a product.

e Relationship between the rate of inflation and prices.


¢ Relationship between compensation paid and employee motivation.

® Relationship between discounts offered and units sold bya retail outlet.
e Relationship between footfall size and rental rate of shops in a shopping mall.
© Relationship between the speed of the conveyor belt and the number of defective
items produced ona shop floor.
e@ Relationship between brand equity and equity price of companies.

® Relationship between stock prices and GDP.

e Relationship between the cost of shipping and shipping technology used by a


logistics company.

12.3 TYPES OF RELATIONSHIP

12.3.1 SIMPLE OR MULTIPLE

In a simple relationship, only two variables (one dependent and second independent) are
involved. The relationship between two variables is measured by using a simple correlation
196
OPMCO001
Business Statistics

and the value of the response variable (dependent variable) is predicted by using simple NOTES
regression. The base price (response variable) of a player in an IPL auction depends on his
average ratings (explanatory variable) during the year and can be predicted by using a simple
regression equation (regression model}.
The relationship between more than two variables (one dependent and more than one
independent) is measured by using multiple correlations, and the value of the response
variable (dependent variable) is predicted by using multiple regressions. The number of
units sold of a product (response variable} depends upon advertising expense, personal
selling expense and sales promotion expense (all explanatory variables). The number of
units sold, with more than one explanatory variable, can be predicted by conducting
multiple regression analyses.

12.3.2 POSITIVE OR NEGATIVE

When the direction of change in the values of variables involved in the relationship is the
same (i.e. when dependent and independent variables either increase or decrease
simultaneously) the relationship is positive (or direct). When the direction of change in the
values of variables is the opposite (i.e. when one variable increases and the other decreases
or vice versa) the relationship is negative (or indirect). Brand equity and product quality have
a positive correlation, whereas the price and demand are negatively correlated.

12.3.3 LINEAR OR NON-LINEAR


When various values of dependent and independent variables are plotted on a two-
dimensional plane move asa straight line, the correlation between them is said to be linear. If
various values of variables form a curve rather than a straight line, the correlation between
them is non-linear (or curvilinear). The number of profiled customers of a bank and the
amount of loans disbursed generally depict a linear correlation. The age of the car and its
maintenance expenses depicts a non-linear correlation, as the maintenance cost is less
when the car is new and increases more rapidly as the car gets older.

CHECK YOUR PROGRESS -I


Insurance Companies wants to predict the premium amount paid by customers based on
their annualincome by collecting the data from a sample of customers.

Q.1 can be used to achieve the above-mentioned objective.


Q.2 Relationship between the premium paid and annual income will be:

(a) Simple Non-linear


(b) Simple linear
(c) Multiple Non-linear

Q.3 Correlation analysise will not be enough in checking the direction and strength of the
relationship between the premium paid and annual income. (True / False)
197
UNIT 12
Simple Linear Regression

j 12.4 MEASURING SIMPLE RELATIONSHIPS


Simple Relationship (the relationship between two variables) can be measured by using the
following methods:

12.4.1 SCATTERED PLOTS

AScatter plot (or Scatter Diagram) is used to examine the relationship between an X variable
(independent or predictor) on the horizontal axis and a Y variable (dependent or
explanatory) on the vertical axis. The nature of the relationship can take different forms. The
following section provides a snapshot of various types of relationships obtained by using to
scatter plots.
Fig. 12.1

| Panel E: Strong relationships |

Panel G: No relationship

Y @ © »
@@ 6
3-4 ee

x
| Panel H: No relationship
EX

198
OPMCO001
Business Statistics

The simplest relationship consists of a straight-line relationship or a linear relationship NOTES


(Panel A, B, E, and F in Fig. 12.1). Panels (A, B, E, and F) suggest a linear correlation between X
and Y. However, correlation (or relationship) in Panel Ais positive (or positive linear) as X and
Y increase simultaneously, whereas in Panel B the correlation is negative (or negative linear)
because of decrease in Y as X increases.

Both Panel C and D suggest a non-linear correlation (curvilinear relationship} between


variables. In panel C, initially, Y increases as X increases but after certain instances Y
decreases as X increases, thereby suggesting a relationship in the form of a curve. In panel D,
initially, the increment in the value of X and Y is constant, later on, change in the value of Y is
more rapid as compared to the change in the value of X, and finally, only
Y changes and X does
not change at all.

Panels E shows that the correlation (relationship) between X and Y is strong as points are
closely scattered around a straight line. However, one relationship (in the panel E) is positive
and the other is negative.

Panel F suggests weak relationships because the points are away from the straight line.
Panels G and H show no correlation (no linear relationship) between X and Y. In panel G, the
scatter of points is not around a line or a curve. In panel H, onlyX changes and Y is consistent,
suggesting that Y does not change with X.
The smaller the distance of points from a line (or a curve) ina scatter plot, the stronger is the
relationship, and the greater the distance of points from a line (or a curve) weaker is the
relationship.
Table 12.1: Illustration Scatter Plot for data

Scatter Plot between Temperature and Sales of


ice creain
_ 60 ri
= ° 8 > 8 e@
gS 40

= 20
qoO
2 0
S34 36 38 40 42 44
= Maximum Temperature (Celcius)
ny

199
UNIT 12
Simple Linear Regression

CHECK YOUR PROGRESS - Il


The following data about the premium amount paid an annual income from a sample of
customers was obtained to predict the premium amount based on the annual income of
customers:

Table 12.2

Customer Id C1 | C2 | C3 | C4| CS | C6 | C7 | C8) C9 |C10

Annual Income (Rs. Lakh) 8 |6.5| 7 | 5) 6 | 13 |} 18) 10/| 7.5) 9

Premium Paid (Rs. Thousand)| 18 | 15 | 10) 6 | 7 | 15 | 20/12) 16) 11

Q.1_ Plot scattered diagram between the two variables and state:
(a) Whether the relationship between them - is it direct (positive) or indirect
(negative)?

(b) Whether the relationship between them is strong, moderate, or weak?

12.4.2 KARL PEARSON’S COEFFICIENT OF CORRELATION

Asimple linear relationship between variables can be quantitatively measured by calculating


Karl Pearson’s Coefficient of Correlation r, the value of which ranges between -1 and 1. The
negative value of r suggests that the correlation (relationship) is negative and the positive
value of r suggests that the correlation is positive. Also, the closer the value of rto-1 or 1, the
stronger is the correlation, and the closer the value of r to 0, the weaker is the correlation
between variables. r value of either 1 or -1 suggests such that all points lie on a straight line
onascatter plot, a distant possibility for small samples in real life.
Calculation of Karl Pearson’s Coefficient of Correlation r

Value of rcan be calculated as

Convariance between X and Y


(Standard deviation of X) x (Standard deviation of Y)

___&s
Sx
. Sy

D(%i-X)Wi-Y) qweseeneens Equation 12.1

; Ex i-X)? YWi-¥)?

200
OPMC001
Business Statistics

Example 12.1;
Illustration: Calculation of Karl Pearson’s Coefficient of Correlation {r) for data.

x % | @i-X)| i-¥) | Hi-X)*G-Y) | Gi-*)? | i -¥)?


(Temperature) | (Sales)
38 50 -1.9 0.6 -1.17 3.52 0.39

36 44 -3.9 -5.4 20.83 15.02 28.89


42 49 2.1 -0.4 -0.80 4.52 0.14
41 48 1.1 -14 -1.55 1.27 1.89
40 52 0.1 2.6 0.33 0.02 6.89

38 47 -1.9 -2.4 4.45 3.52 5.64


41 50 1.1 0.6 0.70 1,27 0.39
43 55 3.1 5.6 17.58 9.77 31.64

X=399 |¥=49.4 > % -x)%:-¥) Xe —x? x7 -¥)


= 40.38 =38.88 =75.88

40.38
/(8.88)(75.88)
= 0.743409

r = 0.743409 indicates a strong correlation between ice-cream sales and temperature. The
positive value of rindicates that as temperature increases sales of ice-cream alsoincrease.

The value of rcan also be calculated by using the following equation:

_ nYXY-IRY Equation 12.2


~ VnEx? — (2X)7]Inv¥? — OY)]
Where n= number of pairs of values of X and Y

Note: The value of rwill not change by reversing the names


X and Y used to denote variables.

12.4.3 SPEARMAN’S RANK CORRELATION COEFFICIENT (R)


If the values of variables X and Y are given as ranks (or can be converted to ranks), a simple
linear relationship between variables can be calculated using Spearman’s Rank Correlation
Coefficient R. The value of R is interpreted in the same manner as the variable r (Karl
Pearson’s Coefficient).
The value of Ris calculated by using the following equation:

Roa 1 nr 65D? rte Equation 12.3


n-n
201
UNIT 12
Simple Linear Regression

D=pairwise difference of ranks ofX and Y

Illustration: Calculation of Spearman’s Rank Correlation Coefficient r

Example 12.2:
The average maximum temperature (city-wise ranking) for June weekends for ten cities of
India and the average sales of ice cream (city-wise ranking) is recorded in table 12.3 below:
Table 12.3

City Average Temperature (Ranking) | Average Sales (Ranking)


Agra 1 2

Amritsar 2 1
Bhopal 3 4
Mysore 9 8

Nagpur 4 3
Patna 5 6
Ranchi 7 7
Srinagar 8 10
Vijayawada 10 9
Vishakhapatnam 6 5

Solution 12.2: Difference of Ranks (D) and other calculations are shown below:

City Temp. Rank(R,) | Sales Rank(R,) | D=(R,- R,) D’


Agra 1 2 -1 1

Amritsar 2 1 1 1
Bhopal 3 4 “1 1
Mysore 9 8 1 1
Nagpur 4 3 1 1

Patna 5 6 #1 1

Ranchi 7 7 0 0
Srinagar 8 10 -2 4
Vijayawada 10 3 1 a
Vishakhapatnam 6 5 1 1
n=10 > D’= 12

202
R =1-—&
~~ 403-10
~ 0927273
OPMCO001
Business Statistics

R = 0.927273 indicates a high correlation between ice-cream sales and temperature. The NOTES
positive value of R indicates that as temperature increases sales of ice-cream also increase
and hence the correlation is also positive.
Sometimes when ranks are not provided, we can obtain the rank from the given values of
variables
X and Y to calculate
the value of Spearman’s coefficient R.

Note: Values of Karl Pearson’s Coefficient of Correlation (r) and Spearman’s Correlation
Coefficient (R) calculated for the same data may not be the same.

CHECK YOUR PROGRESS - III


Q.1 For the data about premium paid and annual income of customers, calculate and
interpret
the value of the following:

{a} Karl Pearson’s Coefficient of Correlation


(b) Spearman’s Rank Coefficient of Correlation

Also, which one of the above two is a better reflection of the relationship between the
variables?

12.5 SIMPLE LINEAR REGRESSION


The most elementary regression model is a simple linear regression, which is bivariate
(involve two variables) linear regression used to predict one variable by another variable.
Out of the two variables, one is dependent and another is independent. This model
describes the linear component of the relationship that exists between these two variables.
In correlation analysis, since we are only interested in studying the direction and extent of
the relationship between variables, it is not important to decide which of the two variables
are dependent and independent. But in regression, since we are interested to test the
impact of one variable on the other variable, it becomes important to decide which of the
two variables are dependent and independent. The impact of the independent variable on
the dependent variable is measured.

12.5.1 VARIABLES USED IN REGRESSION ANALYSIS


Independent variable
Variable whose values are manipulated or changed by researchers and whose effects on the
dependent variable are measured is called the independent variable. The independent
variable is also called the explanatory variable, controlled variable, manipulated variable,
input variable, predictor or regressor. Independent variables predict or forecast the values
of the dependent variable in the regression model.
Dependent variable

A variable whose value is predicted is called a dependent variable because its value changes
to changes in the value of the independent variable(s). The dependent variable is also called
explained variable, response variable, output variable, predicted variable, or regressand.
203
UNIT 12
Simple Linear Regression

The dependent variable’s values are predicted in the regression model.


Both variables (explanatory and response) should typically be metric for simple regression
analysis. For non-metric data regression model takes various forms and it out of scope in the
present unit.

CHECK YOUR PROGRESS - IV


Q.1_ For predicting premium paid amount on the basis of annual income, is
predictor variable and is the response variable,

Q.2. It is meaningless to decide which of the above variables are dependent and
independentto predict the premium amount. (True/False).

12.5.2 REGRESSION ANALYSIS

Regression Analysis is the process of constructing a mathematical model of function that can
be used to predict or determine one variable by another variable. Itinvolves the construction
of a linear equation, called a regression equation, that describes the relationship between
the dependent and independent variables.
The general regression model for population data is given below:

LY = 6B, +B,X;,+ €;
The terms used in the model are shown below:

Random
Population _crPwation independent eae

a
Y intercept Variable term
Coefficient

WY BX. + €,+€.
i =B, + B,X,
——~

Linear component Random Error


component

In the above model, the Y=8,+B,X, portion is a straight line. The slope of the line, B,,
represents expected change in Y per unit change in X. It represents the mean amount that Y
changes (either positively or negatively} for a one-unit change in X. The Y-intercept, B,,
represents the mean value of Y when X equals 0.

The last component of the model, €,, represents the random error in Y for each observation,
i. nother words, ¢, is the vertical of the actual value of Y, above or below the expected value Y,
ontheline.
204
OPMC001
Business Statistics
Fig. 12.2, below, provides the model summ
ary for population values of X (indepen
variable) and Y (dependent variable). dent

Fig. 12.2

Y Yi =Bo +B,X; +,
Observed Value
ofY forX,

Predicted Value
of Y for xX,

Xj X
12.5.2.1 SIMPLE LINEAR REGRESSION EQUA
TION
In the preceding section, we discussed the regre
ssion model that describes therelationship
between variables for the population unde
r consideration. However, practically, the
collected from a sample. If certain assumpti data are
ons are valid, we can use the sample Y
b,, and the sample slope b,, as estimates of the intercept,
respective population parameters, B, and B,.
The equation below uses these esti mates
to forma simple linear regression equation.
Y=botb,X_, eeeese Equation 12.4
The straight-line constructed using samp
le data (eq. 12.4) is often referred
prediction line. The following terms are to as the
used in the above model constructed from
values of Xand Y. sample

Estimated

We are required to determine two regre


ssion coefficients b, (the sample Y-intercep
(the sample t) and b,
slope). The most common approach to findi
ng b, and b’ is using the method of
205
UNIT 12
Simple Linear Regression

least squares.

12.5.2.2 THE LEAST-SQUARE METHOD


We can observe in Fig. 12.3 that the actual data values don’t lie on the straight line, but above
or below the line. Because we use the straight-line (prediction line) to predict the values of Y,
to minimize the error, it is required to minimize the vertical distance between actual (Y,) and
predicted (Y) values. The least-square method minimizes the sum of the squared differences
between the actual values (Y,) and the predicted values (Y,). Using the simple regression
equation. The sum of squared differences=5(Y,Y,)*should be minimum.

Since, Y=b,+b, X, therefore,

min) (%—%)? = min) (¥— (by + bX)”

Because this equation has two unknowns, b,and b,, the sum of squared differences depends
on these two unknowns. The least-square method mathematically determines the values of
b,and b, that minimizes the sum of squared differences around the prediction line. Any
values of b,and b, other than those determined by the least-squares method result in a
greater sum of squared differences between the actual (Y,) and predicted (¥) values. The
process of regression model construction involves calculating the values of regression
coefficients b, and b,. Substituting the values of b, and b, in the above equation provides the
resultant regression model that is used to analyze the relationship and predict the values of
the dependent variable.

12.5.2.3 REGRESSION COEFFICIENTS B, ANDB,

The least-squares method provides the following equations to calculate the values of b,
(slope of the line) and b, (intercept),

bi=
2X) YD Equation 12.5
L(%i-X)

or

b1 = Bega seve quation 12.6

bo=Y —b1+*(X) _ ........Equation 12.7

206
OPMCO001
Business Statistics

Illustration: Linear regression model construction for Example no. 12.3

x ¥ | (@%-X)| %-¥) | (-xX)*%-Y | i -xX)?


38 50 -1,9 0.6 “1.17 3.52
36 44 -3.9 5.4 20.83 15.02
42 49 2.1 -0.4 -0.80 4.52
4] 48 1.1 “1.4 “1.55 1.27
40 52 0.1 2.6 0.33 0.02
38 47 -1.9 2.4 4.45 3.52
4] 50 1.1 0.6 0.70 1.27
43 55 3.1 5.6 17.58 9.77
X =39.9 | Y =49.4 V& -x),-Y) >.% _x)?

= 40,38 =38.88
YX=319 | YY=395

40.38
bi = = 1,038585
38.88

b,=49.4 - 1.038585 X (39.9) = 7.961415


The regression model (prediction line) for predicting the sales (kg) on temperature (Celsius)
is,

Y¥=7.961415 +1.038585 X,
Interpreting the values of Slope (b,) and intercept (b,)

In the above regression model (prediction line) slope b, means that sales (predicted value of
a dependent variable, Y) are estimated to increase by 103.8585 kg (or 1.038585 x 100 kg) for
each one-degree increase in temperature (independent variable, X). That also means that
for each one-degree decrease in temperature, sales of the manufacturer would decrease by
103.8585 kg. A minus value of slope b, would suggest that as X increases, Y decreases and
vice versa. Thus, the slope represents the portion of sales that are estimated to vary
according to the temperature.
The Y-intercept, b, represents the predicted value
Y when X = 0. But, the value of b, should be
interpreted cautiously according to the type of variable under consideration. As in our
example, temperature (X) is an interval data, where X = 0 does not mean an absence of
temperature altogether. Since X = 0, in this case, mean no temperature, the value of b, =
7.961415kg (When, X = 0) may not be meaningful.
207
UNIT 12
Simple Linear Regression

NOTES Example 12.3: A consumer research firm wants to predict the weekly expense of consumers
based on their weekly income. Based on the data collected from the consumers’ regression
model, Y= 250 + 200 X, is obtained. Slope value (b, = 200) represents that for every Rs 1,000
earned, the spending increases by Rs 200. A consumer who earns Rs 10,000 per week is
estimated to spend Rs 2,250 (250 + 200 x 10). Intercept value (b, = 250) indicates that when
the consumer does not earn (X=0), his monthly income is predicted to be Rs 250. In this
example value of b,, unlike in the previous example, is meaningful.
Slope: Rate of change in the value of the dependent variable to the independent variable or
change observed in the value of the dependent variable when the independent variable
changes by one unit.
Intercept: Value of the dependent variable when the value of the independent variable is
zero.
Let’s use the prediction line Y= 7.961415 +1.038585 X, to predict the demand for ice-cream
fora particular Sunday having a forecasted maximum temperature of 37 degrees Celsius.

The value ofY for X = 37 will be 46.38906. Therefore, the demand for ice-cream on a day with
a maximum temperature of 37 degrees Celsius will 4638.906 kg.

CHECK YOUR PROGRESS -V


Q.1 Construct a regression model for the data on the premium amount and annual income
(given in table 12.2) of customers.
Q.2 Interpret the value of the slope (b,) and intercept (b,).

12.5.2.4 INTERPOLATION: MAKING PREDICTION WITH REGRESSION ANALYSIS


When using the regression model for prediction, only the relevant range of the independent
variable is considered. This relevant range includes all values from the smallest to the largest
(of X) used in constructing the regression model. Hence, when predicting Y for a given value
of X, we should not go beyond the range of X. This is known as interpolation. In our example,
we can predict the value of Y for every value of X from 36 to 43. Predict the value of Y for an X
value less than 36 or more than 43 will be a case of extrapolation, something that regression
cannot handle efficiently.

CHECK YOUR PROGRESS - VI


Q.1. Whatisthe minimum and maximum value of annual sales (data are given in table 12.2)
between which the regression model would make better predictions?

Q.2 Predicting the values of the dependent variable by extrapolation using regression
analysis provides accurate results. (True/False).

208
125.L.5RESDUALS

Let’s calculate the predicted values of ¥ by using the regression model, ¥= 7.961415 +
1.038565 X developed for example 12.1

x Observed | ive anna + emai ane


50 47 A2S 2.572

36 44 45.35 “135
42 49 51582 -2.58

41 43 50.543 2.54

40 52 49.505 2.495
38 47 47.428 0.43
41 50 SO543 0.54
43 55 52.621 2.379

Values ofX and ¥ used to construct


the model are called as observed values, and the values of
¥ predicted by using tha regression model ara called predicted values of ¥. There is a
differance In the observed and predicted value of ¥. This Is known as residual. Residuals may
be positive or negative depending
on whether the observed polnt|s above or below the line.

Fig. 12.3

y= 10386x + 7.9614 ¢
60

.
sacenetennnenreenenette
eee®
acenee

50
eccnco

‘ cae ucaonede ©
qecesa
2

40

30

20

10

36 37 38 39 40 41 42 43 44
UNIT 12
Simple Linear Regression

The regression equation developed for example 12.2 is shown in Fig. 12.5. As the graph
clearly shows, all values are scattered around the regression line. The vertical distances
between the observed values of Y and the predicted values of Y (estimated on the prediction
line) are residuals. As discussed in the previous section, the method of least squares tries to
minimize these distances in aggregate. It calculates such values of b, and b, and obtains a
regression line that minimizes these vertical distances of all the data points. Even though the
data values don’t lie on the line, still this line is the best line of fit for all points.
Any regression model explains some proportion of variation in the value Y (caused by X) and
at the same time fails to explain some proportion of variation in the value of Y. These will
typically be the points far off from the scatter line. However, if they are few in number and
random numbers, the model can still be used. The portion of the value of Y which remains
unexplained by the regression model is the error term (g).

CHECK YOUR PROGRESS - VII

Q.1 Calculate the predicted values of premium amount using the observed values of
annual sales (givenin Table 12.2). Also calculate residuals.

12.5.3 MEASURES OF VARIATION IN REGRESSION


We calculate the following three measures of variation to further interpret the regression
model.

12.5.3.1 TOTAL VARIATION (SST)

Total Sum of Squares (SST) is the variation ofY values around their mean, Y.
n
SST = Y (Yi - Y). sesssentensnensons Equation 12.8
i=1
SST is divided into two parts, explained variation and unexplained variation, where Y values
are the observed ones.

SST=SSR+SSE

12.5.3.2 REGRESSION VARIATION/EXPLAINED VARIATION (SSR)


Regression Sum of Squares (SSR) also called regression variation is the variation in the value
ofY explained by the relationship between X and Y. Therefore, it is the variation explained by
the model.

SSR = y" (vi -Y)"


i=1
pene Equation 12.9, where” Yi = predicted value.
12.5.3.3 ERROR VARIATION/UNEXPLAINED VARIATION (SSE)

Error Sum of Squares (SSE) also called error variation is the variation in the value of Y due to
factors other than the relationship between X and Y. It represents that part of the variation in
210
OPMC001
Business Statistics

the value Y thatis not explained by the regression.

12.5.3.4 THE COEFFICIENT OF DETERMINATION

The ratio of SSR to SST measures the proportion of variation in Y that is explained by
independent variable X. This ratio is called the coefficient of determination r’.

2 _ Explained Variation _ SSR


——— renrevns Equation 12.11
Total Variation SST

The coefficient of determination measures the proportion of variation that is explained by


the variation in the value of the independent variable X in the regression model.

12.5.3.5 STANDARD ERROR OF THE ESTIMATE

The standard error of the estimate measures the variability of the observed values of Y from
the predicted values in the same way as standard deviation shows the variability of original
values from their mean. In other words, the standard error of the estimate, S,, is the standard
deviation of the regression model calculated as below:

Syx = eee) sa vV(= sesenenensees Equation 12.12


n-2 n-2

The unit of S,, is the same as that of the independent variable.


Illustration: Calculation and Interpretation of Measures of Variation

Let’s calculate and interpret the values of measures of variation, coefficient of determination
and standard error estimate for example 12.1.

Xi Y; vi | (%i-Y) | (vi- Wi)’ | (Vi-Y)


38 50 47.4277 3.79216 6.61697 0.39063
36 44 45.3505 16.1967 1.8238 28.8906
42 49 51.582 4.87082 6.66669 0.14063
41 48 50.5434 1.36518 6.46893 1.89063
40 52 49.5048 0.01685 6.22591 6.89063
38 47 47.4277 3.79216 0.18289 5.64063
41 50 50.5434 1.36518 0.29529 0.39063
43 55 52.6206 10.5338 5.66165 31.6406
Y =49.375 41.9329 33.9421 75.875

211
UNIT 12
Simple Linear Regression

NOTES Regression Variation

ssr=) no.
(Wi-Y)—.2 = 41.9329
i=1

Error Variation

sse=) "
(vi- caeYi)” iD. =33.9421
i=1

Total Variation

n 2
SST =». (vi-Y) = SST=SSR+SSE=75.875
i=1
Approximately 76% variation is observed in the value of sales (dependent variable, Y). Out of
which approximately 42% variation in the value of sales (Y) is caused due to variation in the
value of temperature (independent variable, X). In other words, a 42% variation in the value
of sales is explained by its relationship with temperature. Approximately 34% of the variation
(error variation) in sales is not explained by the model due to the difference between the
observed values and predicted values ofY (vertical distance between points and the line as
seen in the diagram above). This 34% of the variation may be attributed to factors other than
the relationship between temperature (X) and demand (Y).

These measures of variation by themselves provide little information. Their ratios are more
meaningful and provide insights on the accuracy of the regression.

CHECK YOUR PROGRESS - VIII


Q.1 Calculate various measures of variation from the regression model of premium
amount on annual income.
The Coefficient of Determination for the regression model of example 12.1

Explained Variation SSR 41.9329


r? = -rrTeene N = 0.552657 or 55.26%
Total Variation SST 75.875

Approximately 55% of the variation in sales is due to variation in temperature as explained by


the model. It also indicates that the remaining 45% (1-r’) variation in sales may be attributed
to some other factors not considered in the study.
Standard Error of the Estimate for regression model of example 12.1

Syx eee = af (= ) _ af (=9421) _= 2.378449

The standard deviation of the regression model is 2.378449 kg.

212
OPMCO001
Business Statistics

CHECK YOUR PROGRESS - IX NOTES


Q.1 Calculate and interpret the value of the coefficient of determination from the
regression model of premium amount on annual income.
Q.2 Calculate the model standard deviation and specify its unit. The standard deviation
of the model is known as

12.6 LET US SUM UP

Many variables in real life are related to one another in some way or the other. It becomes
crucial for managers to understand the nature of relationships between variables to make
important business decisions. This relationship between variables can be simple or very
complicated depending upon the number of variables involved and how they move to one
another. A simple linear regression model that involves only two variables that are related to
each other only linearly is a starting pointto understand these relationships.
While methods of studying correlation like the scattered diagram method, Karl Pearson’s
coefficient, and Spearman’s Rank coefficient can be used to measure the direction and
strength of the relationship between two variables, a simple linear regression model is used
to predict the values of one dependent variable (Y) based on one independent variable (X).
The simple linear regression model is based on the method of least squares that calculates
such values of intercept (b,) and slope (b,) to obtain the prediction line that minimized the
vertical distance between the actual values and predicted values (on the line) of the
dependent variable (Y).
The slope (b,) of the line of prediction indicates the rate of change in the dependent variable
(Y) for changes in the independent variable (X), and intercept (b,) indicates the possible value
of the dependent variable (Y) when the value of the independent variable (X) is zero. Based
on these values of regression coefficients line of prediction (regression model) is developed
to predict the value of Y from the values of X. Since points don’t strictly lie on the prediction
line there is some difference between the observed and predicted values of Y. This difference
is known as residual which points to error term (€) of the regression model. As the regression
model is not generally used to extrapolate, we use interpolation to predict the value of Y
fromthe model.
To measure the variation caused in the value of the dependent variable (Y) due to variation in
the value of the independent variable (X) measures of variation are calculated. Three
measures of variation are SST (total sum of squares) or total variation, SSR (regression sum of
squares) or regression variation, and SSE (error sum of squares) or error variation. Since
these measures in themselves don’t communicate a lot, their ratios are calculated. The ratio
of SSR to SST called a coefficient of determination is used to explain the variation in the value
ofY explained by the relationship between X and Y. The remaining proportion that remains
unexplained by the relationship between X and Y is the error part. The standard error of the
estimate calculated from SSE, expressed in the same units as that of the dependent variable,
is the standard deviation of the regression model.
213
UNIT 12
Simple Linear Regression

NOTES 12.7 KEYWORDS


Coefficient of Determination: The ratio of regression variation (SSR) to total variation (SST)
used to explain the variation in Y due to the relationship between X and Y.
Correlation: A technique used to measure the direction and strength of the relationship
between two variables.
Dependent Variable: Variable whose value is predicted.
Error Variation: Variation in the value of Y not explained by the regression model.

Independent Variable: Variable used to predict the value of the dependent variable.
Intercept: Value ofY for X=0 in the regression equation.

Interpolation: Using minimum and maximum observed values of X to form a relevant range
for which Y will be estimated by the regression model.
Karl Parson’s Coefficient of Correlation: A unit free term that has a magnitude and sign to
measure the strength and direction of the relationship between two variables.
Method of Least Squares: The method that minimizes the sum of squared deviations
between observed and predicted values of Y for calculating the value of the slope and
intercept of the prediction line.
Multiple Regression: The regression model used to depict the relationship between more
than two variables.

Prediction Line: The line obtained such that the vertical distance between observed values
and predicted values of
Y is minimum.

Regression Variation: Variation explained by the regression model.


Residual: The difference in the observed and predicted value of Y.

Scattered Plot: Two-dimensional diagram used to show the pattern of relationship between
two variables.
Simple Linear Regression Model: Regression model consisting of a linear relationship
between
two variables.
Slope: Rate of change in Y to X.
Spearman’s Rank Correlation Coefficient: A measure used to calculate correlation when the
data of variables are available as rank orders.
Standard Error of the Estimate: The standard deviation of the regression model.

12.8 REFERENCES AND SUGGESTED ADDITIONAL READINGS


Levine, David M., Stephan, David F., and Szabat, Kathryn A., 2016, Statistics for Managers
Using Microsoft Excel, 7th edition, Pearson India Education Services Pvt. Ltd., Noida.

Levine, Richard L., Rubin Davis S., Rastogi, Sanjay, and Siddiqui Masood Husain, 2013,
Statistics for Management, 7th edition, Pearson Education Inc., Noida.
214
OPMCO001
Business Statistics

Shrivastava, T.N, and Rego Shailaja., 2008, Statistics for Management, Tata McGraw-Hill,
New Delhi.

Gupta, S.P. and Gupta M.P., 2010, Business Statistics, 16th edition, Sultan Chand & Sons, New
Delhi.

https://economictimes.indiatimes.com/industry/cons-products/food/india-gate-indias-
single-largest-ice-cream-selling-point/articleshow/5753043.cms?from=mdr
https://timesofindia.indiatimes.com/city/delhi/Delhi-slurped-ice-cream-worth-Rs-20-
lakh-this-weekend-at-India-Gate/articleshow/47580940.cms

https://www.accuweather.com/en/in/delhi/202396/june-weather/202396

12.9 SELF-ASSESSMENT QUESTIONS


Multiple Choice Questions
Q.1 Correlation between the profit of mobile phone manufacturing companies and the
weight of their CEOs
{a} cannot be determined

(b) isLinear
{c) isnonsensical
(d) ismultiple

Q.2 Therelationship between the price of acommodity and quantity demanded is:
(a) Linearand positive

{b) Linearand negative


{c) Non-linear
and positive

(d) Non-linear
and negative

Q.3 Fitting astraight line to a set of data yield the regression model Y= 5.5 + 20.05 X,. What
will be the estimated value of Y for X= 15?

{a} 15
(b) 103
(c) 25.55
(d) 306.25

Q.4 Forthe regression model Y= 5.5 + 20.05 X, the value of residual for (X, Y) = (20, 400)
will be
(a) -6.5

(b) 6.5
(c) 406.5
215
UNIT 12
Simple Linear Regression

(d) -406.5
Q5 The slope of aregression line can be defined as:

(a) Change in the value of the dependent variable when the independent variable
doesn’t change.

(b} Value of the dependent variable when the value of the independent variable is
zero.
(c) Change in the value of the dependent variable when the independent variable
changes by one unit.

(d) Change in the value of the dependent variable when the independent variable
change by the same quantity.

0.6 The least-squares method minimizes the sum of the squared difference between:
(a) Observed values and mean value of the dependent variable.

(b) Predicted values and mean value of the dependent variable.

(c) Observed values of the independent variable and predicted values of the
dependent variable.

(d) Observed and predicted values of the dependent variable.


Q.7 Value of Intercept of a regression line:

(a) Isnever meaningful.


(b) Issometimes meaningful.

(c) Isalways meaningful.


(d) Ismeaningful only when the value of the independent variable is zero.
Q8 Interpolation in regression means:

(a) Predicting the value of the dependent variable for any value of the independent
variable.

(b} Predicting the value of the dependent variable for a given range of values of the
independent variable.
(c) Predicting the value of the dependent variable for such values of the independent
variable which are beyond the range.
(d} Predicting the value of the dependent variable for a range of values of the
independent variable which are obtained only from the data used to develop a
regression model.
Q9 Which of the following statements is incorrect about Karl Pearson’s Coefficient (r) and
Spearman’s Rank Coefficient (R) of correlation?

(a) Value Rand ris always the same fora given data
(b) Values of both lie between -1and1
216
{c} Both are Interpreted sirilarty
(d) FR ls based on ranks, and r Is basad on values of dapandent and Indepandent
variables
0.10 Thesimple linear regression medel is based upon:
(a) More than two verlables that can be plotted asa stralght line.

(b) Twovarlables that can be plotted


asa straight line.

{ct} Anynumberof variables that can be plotted


asa stralght line.

(dj) Any number of variables that can be plotted as a straight IIne but at least one
variable should be categorical.

12.10 CHECK YOUR PROGRESS — POSSIBLE ANSWERS


CHECK YOUR PROGRESS-]
Q.1 Regression Analysis
Q.2 (b) SimpleLnear
Q3 False
CHECK YOUR PROGRESS-=
Qi

Scatter Plot Between Annual Income and


Premium Paid
5

20 - @
uu @
a 45 e ® e
5 e
= 10 4 e e
- e@

0 T T tT |

0 5 10 15 20
Annualincome

{a} «= Director (positive) indicating that both variables move in the same direction.

(b) The relationship between them is moderate as the data values are neither highly
Sseattered nor packed together.
2i7
UNIT 12
Simple Linear Regression

i CHECK YOUR PROGRESS


- III
Q.1

(a) Karl Pearson’s Coefficient of Correlation (r)=0.667522


(b} Spearman’s Rank Coefficient of Correlation (R}=0.745455

Karl Pearson’s Coefficient of Correlation (r) = 0.667522 better reflects the


relationship between variables as its value is obtained from actual values of X and
Y rather than converting the data into ranks. r = 0.667522 indicates a positive and
moderate relationship.
CHECK YOUR PROGRESS - IV

Q.1 Predictor variable is annual income and the response variable is the premium paid.

Q.2 False
CHECK YOUR PROGRESS -V

Q.8 Regression model: ¥=5.912088 + 0.787546 X,


Q9 Slope Value b, = 0.787546 indicates that the premium amount changes by Rs 0.787546
thousand or Rs 787.546 with every Rs 1 Lakh change in the value of annual income. If
annual income increases (or decreases) by Rs 1 Lakh, the premium is estimated to
increase (or decrease) by Rs 787.546.

Intercept Value b,= 5.912088 indicates that when annual income is zero premium paid
would be Rs. 5912.088. But it is highly unlikely that a person who doesn’t earn will
purchase an insurance policy and pay the premium. Therefore, the value of the
intercept should be interpreted carefully.
CHECK YOUR PROGRESS
- VI

Q.1 Minimum Value=Rs5 Lakhs ; Maximum Value = Rs 18 Lakhs

The regression model shall make the best predictions for the premium amount for
annual sales amount between Rs 5 and Rs 18 Lakhs.

Q.2. False

218
OPMC001
Business Statistics

CHECK YOUR PROGRESS - VII


Q.1
Annual Sales (X) Premium Paid {Y) Predicted Y Residuals

8 18 12.21245421 5.78755

6.5 15 11.03113553 3.96886

7 10 11.42490842 -1.4249

5 6 9.84981685 -3.8498

6 7 10.63736264 -3.6374

13 15 16.15018315 -1,1502

18 . 20 20.08791209 -0.0879

10 12 13.78754579 -1.7875

7.5 16 11.81868132 4.18132

9 11 13 -2

CHECK YOUR PROGRESS - VIII


Q.1 SST= 190; SSR
= 84.66; SSE =105.34

CHECK YOUR PROGRESS - IX


Q.1 Coefficient of determination r = 0.445585 indicates that 44.5% variation in the value
of the premium is due to its relationship with annual income. The remaining 55.5%
variation would be due to factors not considered in the model.

Q.12 Model Standard Deviation, S,, = 3.628685 thousand rupees or Rs. 3628.685. It is
known as the standard error of the estimate.

12.11 ANSWERS TO SELF-ASSESSMENT QUESTIONS


Q1 (c)
Q.2 (b)
Q3 (d)
Q4 {a}

Q5 (c)
Q6 {d)
Q7 (b)
219
UNIT 12
Simple Linear Regression

NOTES Q8 (d)
Q9 (a)
Q.10 (b)

220
OPMC001
Business Statistics

SIMULATION

STRUCTURE
13.0 Objectives
13.1 Introduction
13.2 Definition
13.3. Applications
of Simulation
13.4 Advantages and Disadvantages
13.5 Types
of Simulation
13.6 Stepsin simulation
13.7 Random Numbers
13.8 Monte Carlo Simulation
13.9 Applications
of Simulation

13.0 OBJECTIVES
After reading this unit, you will be able to:
e understand what simulation is and how it aids in the analysis of a problem

e learn why simulation is a significant problem-solving tool


e understand the difference between static and dynamic simulation

e identify the important role probability distributions, random numbers, and the
computer play in implementing simulation models
e realize the relative advantages and disadvantages of simulation models.

13.1 INTRODUCTION

A simulation is a computerized model that replicates the operation of the real world,
providing a realistic and enticing experience to the learner. Before any important event, we
perform rehearsals for the smooth functioning of the event. It helps to locate the pitfalls and
rectify the problems before the “real thing”. Simulation models have importance in the
aviation industry where the proposed aircraft is tested for its aerodynamic properties before
the final model is made, disaster management where simulation techniques are used to
221
UNIT 12
Simple Linear Regression

NOTES create the conditions similar to a natural disaster (the well-known fire drill), spacecraft,
training of pilots through simulators, etc. So that the team is well trained to be ready for
rescue operations in case of a disaster. Defense uses simulation games to prepare the
soldiers for an attack, and to strategize war techniques, Simulation also helps in quantifying
the relationships among complex variables that cannot be solved mathematically. To obtain
the best learning outcomes simulation has been used even in fields like physical sciences,
engineering, statistics, finance, etc. A simulation model is prepared using the assumptions
onthe operation of the system. Hence, simulation models are very helpful in developing vital
skills required by an employee to perform a particular job productively.

13.2 DEFINITION

As per the Cambridge dictionary, simulation is defined as a model of a set of problems or


events that can be used to teach someone how to do something or the process of modeling”.
As per Merriam Webster simulation is defined as: “the imitative representation of the
functioning of one system or process employing the functioning of another” or “examination
of a problem often not subject to direct experimentation employing a simulating device.”

Eg. acomputer simulation of space flight, a simulation of the planet’s surface.

13.3 APPLICATIONS OF SIMULATION

Due to its flexible methodology, simulation is applied to all the fields. One of the greatest
strength of simulation is to answer “what - if” questions. Below are some of the real
applicationsof simulation

e Medical researchers use animals to stimulate the effect of a new drug before it is
before introducing them to human beings.
e Fire fighters conduct various drills to prepare their team in the time of need.

e All the commercial pilots are trained in a simulator and exposed to extreme
weather conditions before they fly an aircraft.
® The automobile industry stimulates accidents to test the safety of a car when it
meets with an accident.
e Thesetting of the stock level to meet the fluctuating demand at retail stores.

e Bidding
for drilling projects
When should we Simulate?

Simulation is used when we are dealing with the problems that are very complicated and
there are no optimal solutions or when it is very risky or costly to experiment with real
situations e.g. when a person enrolls for driving or flying. To teach driving, the person is
exposed to stimulators to train them on basic operations of driving as itis very risky to take an
immature driver directly on the road where he may lose control and get embroiled in serious
accidents. For successful training, it is important that simulation adequately imitates the real
222
OPMCO001
Business Statistics

conditions. NOTES
In general, simulation is used when

e Uncertainty exists in the system (e.g. disaster management).


e Real experiments are expensive or sometimes not possible (e.g. training of
astronauts).
e The process is repetitive (e.g. production line).

e Tochoose the best option out of multiple options available.

e Whenitis not advisable to experiment with the real system (fighter pilot).
e Tostudyasystem that deals with uncertainty.

e Situations that conforms to principles of logical reasoning.


Hence, some prior problems can be figured out and solved with the help of simulation before
actual testing of the system.

13.4 ADVANTAGES OF SIMULATION

The main advantage of simulation is that unlike the deterministic model the variables are not
fixed in advance, we can vary them randomly and ascertain how the system behaves, and
what happens to key decision variables when the values are changed. In this way, we
determine the range of possible outcomes with their associated probabilities, and a
sensitivity analysis is carried out to figure the best possible outcome. Simulation is very
important under the scenario when the experiment with the real situation is costly and risky.
Simulation has become a popular tool for decision-making in many areas due to some of the
reasons shared below:
e = It is easy and flexible.

e Helpsusto visualize the implications of assumptions used for making a model.


e Doesnot interfere with real-world problems.
e Less disruptive then real experiments.
e Datacan be easily generated.
e Saves cost.

e Enumerates the risk associated with different probabilities of events.


A Simulation project is carried in several phases and starts by identifying the objectives,
designing a model, collecting data, specifying variables, etc. to finally arrive at the findings.
Disadvantages of Simulation

With so many advantages of simulation, there are some disadvantages too with the
model
e |t is based on a trial and error approach which generates different output in
different runs. 223
UNIT 13
Simulation

NOTES The simulation model does not provide any solution by itself the user has to
specify the constraints for which the modeling is to be done.

At times people don’t take it seriously or suffer fatigue if they are made to practice
on simulation model as they know that it is just a virtual exposure.
Difficult for many people to understand the abstraction as the solutions are based
on virtual modeling.

13.5 TYPES OF SIMULATION

Simulation models can be classified in different ways. Simulations can be classified as


discrete or continuous, fixed interval or next event, and deterministic or probabilistic.

Discrete Versus Continuous

Ina discrete system, the changes in the system state are discontinuous i.e. state of
variable changes only at a countable number of points in time. The change in the
state of the system is called an event. E.g. the number of cars serviced, the
number of complaints received, and the arrival or departure of customers in a
queue.

In other situations, the system changes smoothly with time i.e. the variable of interest can
assume either integer or non-integer values e.g. weight, height, volume, etc. Thus the
distinction between variables whether discrete or continuous is important before applying
simulation design.
Fixed Interval Versus Next Event Simulation

In a fixed time, interval, the computer is programmed to simulate at fixed time


intervals and the system checks whether the event has taken place in this time
interval or not, all the events that take place during that time interval are treated
as outcomes. E.g. the manager of the car showroom is interested in the number of
sales of car in a day rather than the different points in time when the car was sold
like the manufacturer of a textile mill is interested to know the number of breaks
(defects) per square yard of cloth, rather than when or where it occurred.

The other type of simulation emphasizes on when an event occurs, the computer
programmed to generate the time to the occurrence of the event - for example in
a production line if the machine breaks down the manager is interested how long
the machine operated between breakdowns and how long it will take to repair the
machine.

So, if the interest lies in the occurrence of an event or how much time or effort is required,
the fixed interval or next event simulation is used.
Deterministic Versus Probabilistic Simulation

A deterministic model is the one under the same initial conditions that always
gives the same results every time we run the model. Most of the mathematical
224
OPMC001
Business Statistics

models are deterministic, e.g. speed=distance/time, so for distance =150 km and


time taken = 3 hours will always result in a speed of 50km/hr. Similarly, the
binomial distribution will always yield the same result
for fixed p and n.
@ Onother hand, the probabilistic model incorporates an element of randomness.
Every time the model is run, it generates different results even when initial
conditions are kept the same. Simulation is very helpful here.

13.6 STEPSIN SIMULATION

Simulation usually involves the following 7 steps described below:


(i) | Define the problem and set objectives
(ii) Gatherdata
(iii) Develop the model

(iv) Validate the model

(v) Designthe experiments


(vi) Performsimulation runs

(vii) Analyze and interpret the results

13.7 RANDOM NUMBER

The procedure of sampling from probability distributions is known as Monte Carlo Sampling.
It is based on the frequency distribution of probability. The sample generated using this
procedure should be independent. As the probabilities are calculated to 2 decimal places,
adding up to 1, we need 100 numbers of 2-digits to represent each point of probability. A
random number between 00 to 99 are used to represent the same. Moreover, each random
number in a sequence of 00 to 99 has an equal probability of showing up, and it is also
independent
of any number shown.
Excel has two functions
for generating random numbers.

RAND() and RANDBETWEEN(), both uses the probability density function of continuous
uniform distribution U(0,1).
RAND() function for generating “random” numbers, as the numbers coming from a formula
and hence called pseudo-random.
RANDBETWEEN (low, high) generates a pseudo-random # between low and high, where all
#’s are equally likely.

13.8 MONTE CARLO SIMULATION

The Monte Carlo method was invented by John von Neumann and Stanis law Ulam in the
1940s and sought to solve complex problems using random and probabilistic methods. The
term Monte Carlo refers to the administrative area of Monaco where European elites
gamble. 225
UNIT 13
Simulation

NOTES There are different types of simulation but we would be focusing only on probabilistic
simulations. When we work with a small group of data, the random behaviour of variables
can be mapped by drawing of cards, rolling of dice, flipping of a coin, spinning an arrow ona
common clock, using published tables of a random number, etc. A specific numerical value is
assigned to each of these possibilities of the outcome of the experiment. Though it is a very
simple technique, it is a very time consuming and cannot meet practical requirements when
there is a large number of outcomes of the experiment, like when decision-makers are
interested to know the possibility of an accident in a laboratory due to a radiation leak, or the
number of the breakdown of machines in a production line, the demand of a product, etc., As
there could be a large number of outcomes in these situations the above-mentioned manner
of randomization may not be feasible. In such a case it is convenient to use computer-
generated random variables. Monte Carlo Simulation is a form of computer simulation - a
mathematical technique that generates random variables for modeling risks or uncertainty
of a certain system, using different probabilities to predict the outcome which is difficult to
observe in reality due to the random nature of the variable. The random variables are
generated based on probability distributions such as normal, exponential, etc. Monte Carlo
Simulation is used when the model has uncertain parameters or when the system is very
complex. Its roots lie before World War Il, where pilots and infantry soldiers were trained
with simulators and mock-ups to prepare for battle. Not just in warfare techniques,
simulation has also spread its wings in all most all domains like finance, engineering, supply
chain, physical science, computational biology, statistics, artificial intelligence, quantitative
finance, etc.

Nowadays, even though there is no dearth of information, it is still very difficult to predict the
future with accuracy. In such situations, Monte Carlo Simulation comes to our rescue as we
can visualize the outcomes of the decision which further helps in optimizing better decisions
under uncertainty. As Monte Carlo Simulation uses a probability distribution function for
modeling random variables, different probability distributions generate different outcomes.
In this way, decision-makers obtain a feel about not just what to expect but the probability of
occurrence of that particular outcome. In this manner, it is possible to model the association
between random variables.

Example 14.1 To illustrate, consider that the bakery maintains the record of the sales of
multigrain bread for2 months (i.e. 60 days)

Demand 5 | 6/7 |8] 9} 10; 11] 12 | Total


No. of days} 3 | 11| 7 | 6 | 10) 12; 9 | 2| 60

Based on the above data we can estimate the probability distribution of demand by
converting the frequencies in probability. Hence, the above data can be represented as a
probability distribution table shared below:

Demand 5 6 7 8 9 | 10/ 11 | 12 | Total

Probability | 0.05 | 0.18 | 0.12 | 0.1) 0.17) 0.2 | 0.15) 0.03/ 1.00
226
OPMCO001
Business Statistics

Hence, from the data, we can say that there is a 5% probability that 5 loaves of multigrain NOTES
bread would the demanded on a day, an 18% probability that 6 loaves of multigrain bread
would be demanded, and so on. In this way. The above table serves as a model of simulation
under consideration. We can simulate the model and try to capture the randomness
associated with the demand for loaves of multigrain bread. There are various ways in which
random numbers may be generated, but with computers, it is usually very easy to generate
random numbers.
Steps involved in the Monte Carlo Approach

The steps involved in computing simulation depends on the model applied. At times it can be
very complex if there are too many factors involved. But in general, it has 4 steps.
Step 1. Identify the model

Asimulation is built around the quantitative model of the business plan or the process and is
defined by a series of formulas using mathematical operations, that represents the
characteristics and other features of a system. Simulation can be used to estimate a simple
model to determine the profit vis-a-vis the model involving complex engineering formulas or
statistical models or financial models, etc. The model is then used to simulate to understand
how the system will behave in particular, scenarios. Simulation is also applied in forecasting
the outcomes ofa situation.
Step 2. Define the Input Parameters

After defining the model, it is also vital to express the equation for each factor and define the
distribution of the parameters. It is possible that the equation works on multiple
distributions, like some parameters may follow normal distribution while others may follow
a uniform distribution. In such a case, to compute the probability, we need to specify the
mean and standard deviation for parameters that followa normal distribution.

Fixing the input parameters is as vital as building a quantitative model. Without the precise
input, a model can never generate the precise output, i.e. the desired outcomes. Compared
to the deterministic approach, a stochastic approach will provide a more reliable conclusion.
While creating a simulation in EXCEL, you can use either of the two formulas mentioned
before to generate random numbers.

Step 3. Create Random Data


To generate randomness, a very large data set for every parameter is required. Excel can
easily generate the random data that follow any specific distribution fora precise parameter.
If interested in generating random numbers froma specific probability distribution, Excel has
statistical functions for probability distributions. These functions can generate random input
values when combined with the RAND function. Below are some of the commonly used
functions.

Normal: DIST, NORM.INV, Standard normal: S.DIST, NORM.S.INV,

t-distribution: DIST, T.INV, F-distribution: DIST, F.INV, Chi-square: DIST, CHI.INV,


227
UNIT 13
Simulation

NOTES Lognormal: DIST, LOG. INV, Binomial: DIST, BINOM.INV, Hypergeometric: DIST

Beta: DIST, BETA.INV, Gamma: DIST, GAMMA. INV, Exponential: DIST, Weibull: DIST, Poisson:
DIST, Negative binomial: DIST

Unless and until a specific distribution is followed by a parameter, the default distribution
used is Normal. The syntax is NORM.INV (probability mean, standard deviation)
To randomize the results, we use the RAND function as the probability argument. The RAND
functions return value specific to the percentile of a random variable with a given mean and
standard deviation. For example NORM.INV (RAND(), 150, 25) means that the variable
follows a normal distribution with mean = 150 and standard deviation = 25,
The mean and standard deviation values should be consistent with the expected collection
of input values. For example, if you are trying to forecast next year's profits, the previous
year's sales amounts can be used as sample data. To excel has built-in functions to calculate
the mean and standard deviation.

Step 4. Simulate and Analyze Process Output

Using the simulated data, to excel can easily calculate the outcome of the model. Most of the
variations in the parameter are captured as the model is evaluated using a large set of
random data.
Real-life applications of Simulation.

Let us consider the same bakery problem and apply the steps of Monte Carlo Simulation in
Excel.
Step1:
As the probabilities have been calculated to 2 decimal places, which add up to 1, so we need
100 numbers of 2-digits to represent each point of probability. A random number between
00 to 99 is used to represent the same. In this example, as the probability of 5 loaves of bread
is 0.05, we have assigned 5 random numbers starting from 00 to 04. Similarly, each demand
level is assigned appropriate intervals of random numbers. Hence a cumulative probability is
calculated to assign numbers to correspond to the same probability range for each event.
Similarly, if probabilities are calculated to 3 decimals then 1000 random numbers are
assigned starting from 000 to 999 and soon.
Demand Probability Cumulative Probability | Random Number Interval
5 0.05 0.05 00-04
0.18 0.23 05-23
7 0.12 0.35 24-34
8 0.1 0.45 35-44
9 0.17 0.62 45-61
10 0.2 0.82 62-81
1 0.15 0.97 82-96
12 0.03 a 96-99
228 Total 1
OPMCO001
Business Statistics

Step 2

After determining the random number intervals, we use to excel to generate the random
number using the function RAND BETWEEN(0, 99) drag this formula to the cells and generate
as many random numbers are required to be simulated. Using this formula, we generate
demand for 15 days. The number generated are: 12, 40, 22, 1, 28, 61, 19, 94, 87, 38, 29, 46,
72,1, 16. Now 12 lies in the range 5-23 corresponding to the demand of 6 loaves of multigrain
bread, inthe same way, the entire table is completed.
In other words, the numbers assigned to each occurrence are directly proportional to its
probability.

Day R. No. Demand

1 12 6

2 40 8

3 22 6
4 1 5

5 28 7
6 61 9

7 19 6

8 94 11

9 87 11

10 38 8
11 29 7

12 46 9
13 73 10

14 1 5

15 16 6

The mean is 7.6.

Hence, with the help of simulation the baker can form an idea about how many loaves of
multigrain bread should he bake to satisfy the demand.
However, if the variables are not truly random and follow a normal distribution, it will lead to
erroneous results.

13.9 APPLICATIONS OF SIMULATION

Simulation of an inventory system


229
UNIT 13
Simulation

In inventory management variations are observed in both the demand and the lead time, in
such asituation simulation can be of help in forecasting
the variables.

Example 14.2
Aconfectionary shop owner is interested to know that with specific re-order levels and re-
order quantities, how he can optimize the total inventory cost. The details of probability
distribution and the various costs are shared below:

Unit Demand| 3 4 5 6 7 8 9 10} 11 | 12

Probability | 0.05] 0.11 | 0.12 | 0.08) 0.18] 0.13/ 0.09/ 0.1 | 0.11 | 0.03

The probability distribution of lead time

Lead time (days) 2 3 |} 4/5


Probability 0.35 | 0.3 | 0.2 | 0.15

The ordering cost is known to be Rs 80 per order, the holding cost is Rs 5/day, while the unit
shortage cost i.e. loss in profits is Rs 20/unit/day. Evaluate a simulation plan for 2 months for
re-order quantity of 50 units, re-order level of 20 units with an inventory balance of 50 units.
Solution:

The first step is to assign a coding system that can assign demand to the random variable. As
the probabilities are calculated to 2 decimals, the random numbers generated are between
00 to 99.

The random numbers coding for both the distributions is shown in the tables below.

Units Demanded] Probability | Cumulative Probability|Random Number Interval

3 0.05 0.05 00-04


4 0.11 0.16 05-15

5 0.12 0.28 16-27


6 0.08 0.36 28-35

7 0.18 0.54 36-53

8 0.13 0.67 54-66


9 0.09 0.76 67-75

10 0.1 0.86 75-85


11 0.11 0.97 86-96

12 0.03 1 97-99
230
OPMCO001
Business Statistics

Lead Time (days)| Probability | Cumulative Probability)/Random Number Interval NOTES

1 0.35 0.35 00-34

2 0.3 0.65 35-64

3 0.2 0.85 65-84


4 0.15 1 85-99

Let’s solve the following problem using EXCEL below. Using the function RANDBETWEEN(),
random numbers are generated for demand. For instance, the random number 58 for day 1,
lies in the interval 54-66 corresponding to the demand of 8 units. With the initial balance of
50 units as shown in the column (7) of the table and demand of 8 units on the first day, the
number of units in the stock would be 42 for this day, involving a holding cost of Rs 210 (42 x
5) as shown in column (9). Now on day 2, the demand simulated is 3 units leaving a balance of
39 units, with a corresponding holding cost of Rs 195 and soon.

Now on day 5, the demand generated is 9 units leaving a balance of 13 units, at this point as
the balance falls below 20 units, an order of 50 units is placed with the corresponding
ordering cost of Rs 80 shown in column (8) At this point a random number is generated for
lead time using the function RANDBETWEEN(), which generates 7 which lies in the interval
00-34 corresponding to lead time of 1 day. Hence, the stock of 50 units would be received the
next day i.e. on day 6 as entered in column (6). We proceed in the same way and tabulate the
results below.
On day 47, the simulated demand is 6 units and the balance is 4 units, in this situation, we can
just satisfy the demand of 4 units and we are left we a balance of 0 units. The 2 units which
could not be sold contribute to a shortage cost of Rs 40 (2 x 20) is shown in column (10).
Ordering | Holding | Cost of
Day R.No | demand | R.No | L.Time | Receipts | Balance Cost Cost | Shortage
(1) (2) (3) (4) (5) (6) (7) (3) (9) (10)
0 50
1 58 8 42 210
2 3 3 39 195
3 81 10 29 145
4 42 7 22 110
5 73 9 7 1 13 80 65
6 63 8 50 55 275
7 31 6 49 245
8 26 5 44 220
9 56 8 36 180
10 35 6 30 150
11 84 10 20 100
12 30 6 30 1 14 80 70
13 16 5 50 59 295
14 47 7 52 260
15 64 8 44 220 231
UNIT 13
Simulation

NOTES 16 68 9 35 175
17 60 8 27 135
18 87 11 53 16 80 80
19 6 4 12 60
20 54 8 50 54 270
21 88 11 43 215
22 48 7 36 180
23 57 8 28 140
24 54 8 20 100
25 82 10 45 10 80 50
26 6 4 6 30
27 12 4 50 52 260
28 50 7 45 225
29 2 3 42 210
30 72 9 33 165
31 37 7 26 130
32 39 7 25 19 95
33 27 5 50 64 320
34 10 4 60 300
35 98 12 48 240
36 82 10 38 190
37 14 4 34 170
38 57 8 26 130
39 54 8 73 18 80 90
40 54 8 10 50
41 42 7 3 15
42 80 10 50 43 215
43 40 7 36 180
44 94 11 25 125
45 73 9 66 16 80 80
46 98 12 4 20
47 30 6 0 80 0
48 68 9 50 41 205
49 37 7 34 170
50 34 6 28 140
51 16 5 23 115
52 12 4 2 19 80 95
53 83 10 50 59 295
54 25 5 54 270
55 80 10 44 220
56 18 5 39 195
57 34 6 33 165
58 78 10 23 115
59 92 11 28 12 80 60
60 D3 8 50 34 270

232
OPMCO01
Business Statistics

Completing 60 days, we find that total ordering cost= Rs 720, the holding cost = Rs 9700, and
out-of-stock cost = Rs 40, adding up to Rs 10,460
Executing the same operations only changing re-order level to 15 units,(not shown here, the
student is expected to compute independently), remaining variables are constant and we
obtain a total cost of Rs 9,675 comprising total ordering cost= Rs 720, the holding cost =
Rs 8735, and out-of-stock cost = Rs 220.

Similarly, the total cost obtained for re-order quantity of 30 units keeping other variables
constant we obtain the total cost of Rs 7,700 comprising total ordering cost = Rs 1,120, the
holding cost = Rs 6,460, and out-of-stock cost = Rs 120.
We can thus optimize the total cost by changing parameters.
For reference, we share the Formulas used in EXCEL to solve the above problem.

233
UNIT 13
Simulation

Simulation of Queuing System


Queuing Systems as based on the assumption that arrivals follow the Poisson distribution
and serving time follows an exponential distribution. Apart from the distribution, the
following things need to be kept in mind before simulating the queuing system.

Assumption. Arrivals are infinite, waiting capacity is unlimited, and customers are served in
the order
of their arrival (FCFS), arrivals are random, service times are random.

Example 13.3
In a large bank, the manager is concerned about the waiting time of customers. He is ina
dilemma about whether to hire more staff to raise the level of service, but this will also lead
to an increase in the idle time of staff He wants to determine how many staff to be hired to
minimize the total cost involved. He has shared the data of the times between successive
arrivals and service times for the past 200 observations. He requires help in optimizing the
cost.
Distribution of inter-arrival time

Time (Minutes) | Frequency

0 12

3 18

6 50

9 74

12 32

15 14

200

234
OPMCO001
Business Statistics

Distribution of service time

Time (Minutes) | Frequency

4 8

6 20

8 36

10 88

12 48

200

Solution:

Just like the previous example, the first step is to estimate the probability and cumulative
probability. Then based on cumulative probability random numbers are assigned to each
observed arrival time. Similarly, random numbers are assigned to service times also.
Random Number Coding for Inter-arrival times.

Time (Minutes) | Frequency | Probability | Cumulative Random


Probability | Number Interval

3 12 0.06 0.06 00-05

6 18 0.09 0.15 06-14

9 50 0.25 0.4 15-39


12 74 0.37 0.77 40-76

15 32 0.16 0.93 77-92

18 14 0.07 1 93-99

200 1

235
UNIT 13
Simulation

Random Number Coding for Services times

Time (Minutes) | Frequency | Probability | Cumulative Random


Probability | Number Interval

4 8 0.04 0.04 00-033

6 20 0.1 0.14 04-13


8 36 0.18 0.32 14-31

10 88 0.44 0.76 32-75

12 48 0.24 1 76-99

200

Now we are ready to simulate the operation. To determine the arrival times, we start with 5
major columns and for service time we start with 4 major columns. We shall simulate the
bank problem for 30 days.

Now, the first random variable for arrival is 48 which lies in the interval 40-76 corresponding
to 12 minutes. Assuming that the bank starts at 9:00 AM, the first arrival in the bank takes
place at 9:12 AM. Further, the random variable for service time is 40 which lies in the interval
32-75 corresponding to 10 minutes. Thus, the service time starts at 9:12 and ends at 9:22.
There would be no waiting time for the mechanic. The second random variable for arrival is
49 which lies in the interval 40-76 corresponding to 12 minutes. So the second arrival would
be 9:24AM. The second random variable for service time is 57 which lies in interval 32-75
corresponding to 10 minutes. As the first customer would leave the bank by 9:22, the waiting
time for the second customer is also 0. As there is no customer before him, he is the only
person standing in the queue, so the queue length is 1. We complete the table in the same
way, generating random numbers for Day 1.

236
OPMCO001
Business Statistics

Day Arrivals Service Waiting Service Queue NOTES


Arrival
Number |R,No_ | Time O'Clock R,No Time Time (mins} | Begins Ends Length
{1) (2) (3) (4) (5) (6) (7) (8) {9) (10)
1 48 12 9:12 AM 40 10 0 09:12 09:22 1
2 49 12 9:24 AM 57 10 0 09:24 09:34 1
3 85 15 9:39 AM 19 8 0 09:33 09:47 1
4 45 12 9:51 AM 48 10 0 09:51 10:01 1
5 44 12 10:03 AM 81 12 0 10:03 10:15 1
6 76 12 10:15 AM 30 8 0 10:15 10:23 2
7 0 3 10:18 AM 91 12 5 10:23 10:35 2
8 91 15 10:33 AM 36 10 2 10:35 10:45 2
9 39 9 10:42 AM 98 12 3 10:45 10:57 2
10 54 12 10:54 AM 89 12 3 10:57 11:09 2
11 36 9 11:03 AM 2 4 6 11:09 11:13 1
12 90 15 11:18 AM 46 10 0 11:18 11:28 1
13 80 15 11:33 AM 66 10 0 11:33 11:43 1
14 41 12 11:45 AM 17 8 0 11:45 11:53 1
15 93 18 12:03 PM 52 10 0 12:03 12:13 1
16 53 12 12:15 PM 2 4 0 12:15 12:19 1
17 67 12 12:27 PM 6 6 0 12:27 12:33 1
18 49 12 12:35 PM 86 12 0 12:39 12:51 2
19 24 9 12:48 PM 77 12 3 12:51 01:03 2
20 38 9 12:57 PM 85 12 6 01:03 01:15 2
21 46 12 1:09 PM 25 8 6 01:15 01:23 2
22 58 12 1:21 PM 58 10 2 01:23 01:33 2
23 16 9 1:30 PM 94 12 3 01:33 01:45 2
24 5 3 1:33 PM 38 12 12 01:45 01:57 2
25 65 12 1:45 PM 11 6 12 01:57 02:03 2
26 54 12 1:57 PM 8 6 6 02:03 02:09 2
27 25 9 2:06 PM 52 10 3 02:09 02:19 2
23 38 9 2:15 PM 83 12 4 02:19 02:31 2
29 79 15 2:30 PM 22 8 1 02:31 02:39 1
30 91 15 2:45 PM 65 10 0 02:45 02:55 1
77.00 46

The average waiting time is 77+30 = 2:57 minutes and the average Queue length is 46+30=
1.53. In this way by running this simulation multiple times the manager can decide on the
average waiting time and average queue length in his bank. Using this information, he can
examine the alternatives by adding multiple counters in the bank and determine the
outcome based on new service patterns and opt for the best alternative.

Some more examples:-


Example 13.4:

A company manufactures 32 units per day. The sale of these items depends upon demand
which has the following distribution.

237
UNIT 13
Simulation

sales probability

30 0.35
31 0.15

32 0.05
33 0.1

34 0.15

35 0.2

The production cost and sale price of each unit are Rs 60, and Rs 80 respectively. Any unsold
product is to be disposed of at Rs 40 per unit. There is a penalty of Rs 5 per unit if the demand
is not met. Using the following random numbers, estimate the total profit/loss for the
company for the next 10 days. 1, 9, 17, 99, 20, 85, 77, 63, 13, 38. Will it be advantageous to
produce 30 units per day

Solution:

Sales Probability | Cum Prob Cumulative Random


Probability | Number Interval

30 0.35 0.35 0.04 00-34

31 0.15 0.5 0.14 35-49

32 0.05 0.55 0.32 50-54

33 0.1 0.65 0.76 55-64

34 0.15 0.8 1 65-79

35 0.2 1 80-99

Profit= Rs 80 - Rs 60 = Rs 20/unit

Loss = Rs 40/unit
Penalty on stock-out = Rs 5/unit

238
OPMCO001
Business Statistics

Day R.No. | Sales Profit/Loss per day with production


32 units 30 units

1 1 30 |30X20-2x40=| 520 30X20= 600


2 65 34 | 32x20-2x5= | 630 | 30X20-4X5= | 580

3 50 32 32X20 = 640 | 30X20-2Xx5= | 590


4 99 35 | 32X20-3X5= | 625 | 30X20-5X5= | 575
5 48 31 | 31X20-1x5= | 615 | 30X20-1X5= | 595
6 88 35. | 32X20-3X5= | 625 | 30X20-5X5= | 575
7 77 34 | 32X20-2x5= | 630 | 30X20-4X5=| 580
8 63 33. | 32X20-1X5= | 635 | 30X20-3X5= | 585
9 13 30 | 30X20-2x40= | 520 30X20= 600
10 38 31 | 31X20-1x5= | 615 | 30X20-1X5= | 595
Profit 6055 5875

The total profit for 10 days is Rs 6,055 when 32 units are produced, and if the company
produces 30 units, then the total profit is Rs 5,875. Hence, the company should continue to
produce 32 units.

Example 13.5:

The toy factory produces robots that undergo two assembly lines to get the final product.
The processing time for each of the assembly line is regarded as a random variable and is
described by the following distributions.

Process Time (mins) Assembly1 Assembly2

5 0.1 0.35

6 0.15 0.3

7 0.2 0.2

8 0.25 0.1

9 0.18 0.03

10 0.12 0.02

239
UNIT 13
Simulation

Using the following random numbers find the expected process time for the period.
R. No for assembly 1: 34, 43, 2,5, 28, 76, 33, 45, 89, 24, 43, 15,90, 80,9

R. No for assembly 2: 21, 83, 36, 75, 74, 11, 94, 34, 19, 8,91, 44, 12, 65, 54
Solution:

Process Time Assembly1 | Cum. R.N. Process Time) Assembly2| Cum. Prob R.N.
(mins) Prob | Interval (mins) Interval
5 0.1 0.1 00-09 5 0.35 0.35 00-34

6 0.15 0.25 10-24 6 0.3 0.65 35-64

7 0.2 0.45 25-44 7 0.2 0.85 65-84

8 0.25 0.7 45-69 8 0.1 0.95 85-94

9 0.18 0.88 70-87 9 0.03 0.98 95-97

10 0.12 1 88-99 10 0.02 1 98-99

Unit Assembly 1 Assembly 2 total time

R.No Time R.No Time

1 34 7 21 5 12
2 43 7 83 7 14
3 2 5 36 6 11
4 5 5 75 7 12
5 28 7 74 7 14
6 76 9 11 5 14
7 33 7 94 8 15

8 45 8 34 5 13
9 89 10 19 5 15

10 24 6 8 5 11
11 43 7 91 8 15
12 15 6 44 6 12
13 90 10 12 5 15
14 80 9 65 7 16
15 9 5 54 6 11
240 200
OPMCO001
Business Statistics

The expected time is 200 +15 = 13.33 mins NOTES


To repeat, we must be certain that the variables generated above are truly random.

(1) National Center for Biotechnology Information. "Introduction to Monte Carlo


Simulation." https://www.ncbi.nim.nih.gov/pmc/articles/PMC2924739/Accessed
March 28, 2020.

241
INDEX NUMBER

STRUCTURE
14.0 Objectives
14.1 Introduction
14.2 Definitions
14.3 Types of Index Number
14.4 Methods of constructing Index Number
14.5 Unweighted Index Number
14.6 Weighted Index Number
14.7 Testtoverify the consistency of Index Number
14.8 Index Number Used in India
14.9 LetUsSumUp
14.10 Key Words
14.11 Self-Assessment Questions

14.0 OBJECTIVES
After reading this unit, you will be able to:

® toexplainthe concept and purpose of index numbers

® tolearn to compute index numbers to measure price and quantity changes and
interpretthem

® todifferentiate between weighted and unweighted index numbers

® tounderstand three principle types of indices


® toavoidthe problems arising out of incorrect usage of index numbers

14.1 INTRODUCTIONS
The price of a commodity, the volume of imports and exports, the quantity of agricultural
production, unemployment, etc. change with time. The changes are neither constant nor
follow a pattern. Some increase with time e.g. population, prices of commodities, etc. while
some of the variables decrease with time e.g. death rate, purchasing power, etc. It is
242
OPMCO001
Business Statistics

important to study these changes to plan for the future. “Indexing” is a technique that
measures changes in a variable or a family of the variables with time, location, or other
characteristics. It is one of the most widely used statistical methods, yet a simple and
effective tool. For example, a pharmaceutical firm may be interested in manufacturing a drug
for cervical cancer, hence they are interested to find out if the no. of cases of cervical cancer
reported this year has increased or decreased and by what extent compared to the previous
year, a housewife needs to compute her monthly budget and interested to know the change
in the price of LPG and essential items over the past year, a company may be interested in
changes in prices of raw materials, wages, advertising costs, share prices, profits, etc. For
arriving at decisions, one may be interested to know how much the price of a good has
changed overtime.

14.2 DEFINITIONS

The index number is a relative measure to compare and describe the average change in
values of prices, quantities, or values of an item or a group of related items over some time
The ratio is multiplied by 100 and expressed as a percentage. As the Index number is
estimated as a ratio of onetime-period over another, it has no unit.
(current period value)
Index number= ——————————————— x 1100
(base period value)

Different authors and institutions across the world have provided different definitions. A few
selected definitions are shared below:
According to Tuttle: “Index number is a single ratio (or a percentage) which measures the
combined change of several variables between two different times, places or situations.”

In the words of Maslow “An index number is a numerical value characterizing the change in
the complex economic phenomenon over some time or space.”
Spiegal defines, “An index number is a statistical measure designed to show changes ina
variable on a group of related variables concerning time, geographical location or other
characteristics.”
According to Croxton and Cowden “Index numbers are devices for measuring differences in
the magnitude of a group of related variables.”
Bowley describes “Index Numbers” as a series which reflects its trend and fluctuations in the
movements of some quantity.”
According to Wheldon, “An index number is a device which shows by its variation the changes
in a magnitude which is not capable of accurate measurement in itself or direct valuation in
practice.”

Edgeworth gave the classical definition of index numbers as follows: “index number shoes by
its variations the changes in a magnitude which is not susceptible either of accurate
measurement in itself or of direct variation in practice.”
243
UNIT 14
Index Number

in the words of Lawrence J. Kaplan, “An index number is a statistical measure of fluctuations
in a variable arranged in the form of a series, and using a base period for making
comparisons.”

Reading the above definition of index numbers, we can see that index numbers are defined
in three categories either as a measure of change, or a device to measure change, or a series
representing the process of change.

14.3 TYPESOF INDEX NUMBERS

Index numbers are broadly classified into three categories (i) price indexes (ii) quantity
indexes (iii) value indexes

(i) Price Index Number

The price index is one of the most prevalent indexes. It is a special type of average
which measures the levels of price from one period to another. In estimating Price
indexes comparisons are made to prices. E.g. wholesale price index number, retail
price index numbers, consumer price index number, etc.

The Price Index number is divided into two categories:

e = Single Price Index


® Composite Price Index

(ii) Single Price Index


A single price index is a measure of changes in price using a percentage scale,
calculated as a ratio of, change in current price per unit ofa product to its base period
price. These index numbers are constructed for single items only. For ease of
comparison with other years, each actual price is converted to a relative price.
Example 14.1 illustrates the example of price relatives.

Example 14.1; Simple Price Index (Base=2010)

Year (1) | Price of wheat per quintal (2)/Ratio (3)=(2)+650 | Percentage Relative
(4)= (3) x 100
2006 650 1 100

2007 750 1.1538 115.3846

2008 1000 1.5385 153.8462

2009 1080 1.6615 166.1538

2010 1120 1.7231 172.3077

2011 1285 1.9769 197.6923

From example 14.1, it is observed that compared to the base year 2010 the price
relative of 115.3846 in 2006 shows an increase of 15.38% in the price of wheat, price
244
OPMC001
Business Statistics

relative of 153.8462 shows an increase of 53.84%, and so on in the price of


wheat/quintal.

Composite Price Index


A price index based on the process of a selected group of items is called a composite
price index, commonly known as a market basket. A composite index number is
constructed from the changes in several items. For example, several hundred goods
and services- such as food beverage, transport, medical care, apparel, entertainment,
etc. are used in calculating the consumer price index. In India, the Consumer Price
Indexis published monthly by the Central Statistical Organization (CSO)

(ii) Quantity Index Number


The quantity index measures the variation in the number of goods produced,
purchased, or consumed between two time periods. Here, the comparison is made in
respect of quantity or volume. For example, the volume of wheat produced,
consumed, imported, exported, etc.

Example 14.2: calculation of quantity index number (the base year 2010)

Year (1) Quantity exported Ratio (3)=(2)+10.75| Index or percentage


(000) (2) relative (4)=(3) x 100
2010 10.75 1 100
2011 11.8 1.0977 109.7674
2012 12.69 1.1805 118.0465

2013 13.9 1.2930 129.3023

2014 14.5 1.3488 134.8837

2015 14.9 1.3860 138.6047

From example 14.2, it is observed that compared to the base year 2010 the price
relative of 109.7674 in 2010 shows an increase of 9.76% and soon.

(iii) Value Index Number

Value index numbers are used to study the change in the total value of a certain period
with the total value of the base period. For example, the turnover of a company in
2020 compared to 2008.

245
UNIT 14
Index Number

Example 14.3: Calculations of value index number (base year 2010)

Year (1) Turnover value Ratio (3)=(2)+16.89 Percentage


(millions) (2) relative (4)=(3) x 100
2010 16.89 1 100

2011 15.2 0.899940793 89.9941

2012 21.34 1.263469509 126.3470

2013 23.12 1.368857312 136.8857

2014 23.5 1.391355832 139.1356

2015 25 1.480165779 148.0166

From the above table, we can say that the turnover of 2010 increased 48% price
relative to the turnover of 2015.

Characteristics of Index Number:

The index number estimates the relative changes in an item or group of items

Itis a special type of average.


The index number is very useful when we have to compare different commodities
measured in different units e.g. consumer index number.

Uses of Index Numbers


Index numbers havea lot of applications in fields of commerce, economics, etc.

Index numbers are used mainly to measure fluctuations in intervals of time,


geographical regions, etc.

Index numbers are used to compare differences in the prices of different


commodities in which the unit of measurements differs with time and price.
They are used to compare the total variations in the prices of different
commodities in which the unit of measurements differs with time and price, etc.
As primary data at different costs are adjusted in Index numbers, it becomes easy
to transform a nominal wage to a real wage.

Index numbers can be used as one of the forecasting techniques, index numbers
estimate trends that help in making conclusions in cyclical and irregular
components.
Limitations/Precautions while constructing index number
As indices are estimated from sample data, the results are to be looked cautiously
as there are chances of committing errors.
Proper selection of the base period is very important, the conclusions changes
246
OPMCO001
Business Statistics

with change in the base year.


e Tocalculate index numbers, we choose a basket of items which depicts trends, so,
proper selection of commodities is important if the items are not chosen
correctly the estimates provide misleading conclusions.

e There are several methods to estimate the index number. Hence selection of
appropriate weights and selection of appropriate formula, is very important any
mistake here would lead to a wrong interpretation.

14.4 METHODS OF CONSTRUCTING INDEX NUMBERS


Different types of index numbers i.e. price/quantity/value can be further categorized as
unweighted index number and Weighted index number - which are further categorized. The
details are shared below:

Methods of constructing index number

Unweighted Weighted

Simple Simple average of Weighted Weighted average of


aggregative price relatives aggregative price relatives

14.5 UNWEIGHTED INDEX NUMBERS


An unweighted price Index Number measures the percentage change in the price of a single
commodity (item) or a group of commodities (items) during two periods of time. By the
name unweighted means allocating no weight to variables, hence in unweighted index
numbers, all the values under study are assigned the same weights i.e. they have equal
importance. There are two methods in this category.
(i) Simple aggregative method:
Under this method, the aggregate prices of all commodities (items) of the current year are
expressed as a percentage of the same in the base year.

Q,- 2a x100
" Sa

DP
de 199
Voi Spode.
Poo

247
UNIT 14
Index Number

p, = Current year prices for various commodities


p, = Base year prices for various commodities

P,, = Price Index number


Limitations of the simple aggregative method

(i) | Thismethod ignores the relative importance of the commodities.


(ii) Ifthe items are highly priced, it influences the index number.

Example 14.4:
Construct the Price Index Number for the year 2019, from the following information taking
2018 as the base year.

Commodities Price in 2018 Price in 2019

oil 100 120

Ghee 450 560

Sugar 32 34

Wheat 25 30

Rice 100 130

Solution:
Construction of Price Index:

Commodities | Price in 2018 (p,) | Price in 2019 (p,)

Oil 100 120

Ghee 450 560

Sugar 32 34

Wheat 25 30

Rice 100 130

Total 707 874

P.
P= ee x 100
0

=1.2362

Price Index in 2018, when compared to 2019 has increased by 23.62%.

248
OPMCO001
Business Statistics

Example 14.5
Calculate Price Index Number for 2018 from the following data by simple aggregate method,
taking 2019 as the base year.

Commodities Price per kg

2018 2019

Potato 15 30

Tomato 50 80

Onion 30 70

Garlic 100 120


Ginger 120 160

Lemon 80 100

Solution:

Commodities Price per kg

2018 (p,) 2019 (p,)

Potato 15 30

Tomato 50 80

Onion 30 70

Garlic 100 120

Ginger 120 160

Lemon 80 100

Total 395 560

Price Index:

P= oe x 100
0

= 5607395

P,1=141.77%
The price index for the year 2016 when compared to 2015 has been increased by 24.13%.
249
UNIT 14
Index Number

(ii) Simple average of price relative method


In this method, averages are calculated in two stages. In the first stage price relatives are
estimated for all items. In the second stage, these price relatives are then averaged to get the
indexnumber. The average could be arithmetic mean or geometric mean or even median.
Let N be the number of items, p, is the price of the commodity
in the current year and p, is the
price of the commodity
in the base year then,
Average Price Index Number Using Arithmetic Mean is estimated using the following
formula:

YFix100
_ 0
Po ~
N
Average Price Index Number Using Geometric Mean is estimated using the following
formula:

log F. 100)
P,, = antilog x
N

Advantages of Average Price Index


1. Asequalimportance is given to all the items; it does not get influenced by the extreme
prices of items.
2. The value of the average price relative index is not affected by the units of
measurement of commodities included in the calculation of index numbers:

Limitations

1. As equal importance is given to every item in the index number i.e., every item in the
index number is given equal weights, but in actual practice, it is not true some price
relatives are more important than others.

2. Out of both the methods arithmetic mean is preferred over geometric mean to
calculate the average price relatives.
Example 14.6

Compute price index number by a simple average of price relative’s method using arithmetic
mean and geometric mean.

250
OPMCO001
Business Statistics

Commodities Price in 2018 Price in 2019

Potato 15 30

Tomato 50 80

Onion 30 70

Garlic 100 120

Ginger 120 160

Lemon 80 100

Solutions:
Calculation of price index number by a simple average of price relatives:

Commodities | Price in 2018 | Price in 2019 |P=P,.*+P, X 100 Log P


Potato 15 30 200 2.30103
Tomato 50 80 160 2.20412

Onion 30 70 233.33 2.367977


Garlic 100 120 120 2.079181
Ginger 120 160 133.33 2.124939
Lemon 80 100 125 2.09691

Total 971.67 13.17416

(i) Price relative index number based on the arithmetic mean:

Dix 100
P, 01 = —~—
N

P,, =971.67+6=161.94
{ii} Price relative index number based on geometric mean:

B
log og (2
| —x x 100 )
P,, = antilog °

=antilog (13.17416+6)
251
UNIT 14
Index Number

TES 5 =antilog (2.1956)


= 156.9

Hence, the price index number based on the arithmetic mean and the geometric mean for
the year 2002 are 161.94 and 156.9 respectively.

Examples based on Section 14.4 - 14.6


14,4 The price of 5 important agricultural products exported for the years 2000 and 2005
are given below. Using 2000 as base period express 2005 price in terms of the unweighted
aggregate index.

Product A B Cc D E

2000 139 300 1,200 | 2,000 80

2005 150 530 1,350 | 2,400 90

14.5 Construct simple average price relative index number using the arithmetic mean for
the year 2012 for the following data showing the profit from various categories sold out in
departmental stores.

Profit (per week) 2010 2012

Stationery items 15,000 12,000

Groceries 1,55,000 1,75,000

Utensils 18,000 16,500

Miscellaneous 65,000 78,000

14.6 Construct simple average price relative index number using the geometric mean for
the year 2015 for the data showing the expenditure in the holiday destination of the family of
three.
Expenditure per week for a family of three 2014 2015

Kashmir 84,000 96,000

Goa 80,000 92,000

Andaman & Nicobar Islands 1,08,000 12,000

Leh 1,50,000 1,68,000

14.6 WEIGHTED INDEX NUMBERS


In computing weighted Index Numbers, depending upon the importance of commodities
252
OPMCO001
Business Statistics

weights are assigned to different items. The common method of assigning weights is either NOTES
by quantities consumed or by its value sold.

Weighted index numbers are also of two types.


(i) Weighted aggregative

(ii} | Weighted average of price relatives


1. Weighted aggregate Price Index

In a weighted aggregate price index, each item in the basket of items chosen for estimating
the index is assigned a weight either by corresponding quantities produced or consumed or
sold to show their importance either in the base year or in the current year. This is because
the consumption quantities of the group of customers will be more for some items and less
for others. Hence, it is also required to obtain a measure of quantity used for the various
items in the group. Estimating the index for the quantity is thus a better estimate than just
estimating the changes in price over time. It helps to improve the accuracy of the price level
estimate. It is useful to monitor changes in price levels over different periods. Inflation
reduces the purchasing power of the individual, hence it is very important to split real
income from nominal income. As there are various ways of assigning weights, there are
many methods available for constructing index numbers. A few important approaches to
determine weights are described below:
a) Laspeyres’ Index (P,,)
b) Paasche’s Index (P,,’)
c) Dorbishand Bowley’s Index (P,,”)
d) Fisher’s Ideal Index (P,,)
e) Marshall-Edgeworth Index (P,,°”
f) Kelly’s Index (P,,)
g) Walsch’s Index
a) Laspeyres’ method
This method was developed by German economist Etienne Laspeyres in 1871. He proposed
to use base year quantities as weights. Hence, it is also called the base year quantity
weighted method. The formula for estimating Laspeyres’ Price Indexis:

Laspeyre’s Price Index= LP 199


Poo
P,= price in the current period

Pp, = price in the base period

q,= quantity consumed in the base period


In the Laspeyres’ Index, the basic assumption is that the individual can afford the same —"
UNIT 14
Index Number

basket of goods in the current period as he did in the base period. This helps to answer the
questions about, how much the income has to be increased to compensate for inflation.

Advantage:
The main advantage of this method is that as the index uses the same base price and
quantity, it becomes easier to compare the index of one period with another.

Disadvantage:
The consumption of commodities increases and decreases with fluctuation in prices, so
keeping the quantity as fixed may not be realistic, It assumes that the consumer is consuming
the same basket of goods as before. However, it has been found that with an increase in
prices in some goods, the quantity consumed decreases and consumption shifts to goods at
lower prices.
The following curve generally holds good.

Fig. 14.1: Demand Curve

PRICE ($) DEMAND CURVE


$100

$75

$50

$25

$0
0 25 50 75 100 QUANTITY (Units)

Example 14.7: Compute the cost of the living index using Laspeyres’ method from the
following information.

Commodities | Quantity consumed in 2018) Price in 2018] Price in 2019

Oil 150 100 120

Ghee 80 450 560

Sugar 100 32 34

Wheat 200 25 30

254 Rice 300 100 130


OPMC001
Business Statistics

Solution:

Commodities | Quantity consumed in 2018)Price in 2018 (Rs}| Price in 2019(Rs)| p,q, Pode

Oil 150 100 120 18,000 | 15,000

Ghee 80 450 560 44,800 | 36,000

Sugar 100 32 34 3,400 | 3,200

Wheat 200 25 30 6,000 | 5,000

Rice 300 100 130 39,000 | 30,000

Total 830 707 874 1,11,200| 89,200

Laspeyre's Price Index =2PM , 100


Poo

=1,11,200 , 100
89,200

= 124.,3367

b) Paasche’s Weighting Method


This method was developed by German economist Hermann Paasche in 1834. As compared
to the Laspeyres’ method, in Paasche’s method the current year quantities are taken as a
weight. In this method, as weights are continuously revised, this method is not used when
the number of commodities is large. The Paasche’s price Index is commonly referred to as
the current weighted index, the Paasche price Index is given as:

Paasche's Price Index = LPid 100


Y Pod

p,= price in the current period

Pp, = price in the base period

q, = quantity consumed inthe current period


A Paasche index of 1 means that the consumer could afford the same basket of goods in the
base period which they can afford in the current period. This information helps to answer the
questions on how much a utility we can take away from an individual at a base price to have
some impact on their utility at the current price.

Advantages
The main advantage of Paasche’s method is, it focuses on changes in price and quantity,
255
UNIT 14
Index Number

hence it provides a better estimate of changes in the index compared to the Laspeyres
method. If the quantity consumed in the base year is the same as the quantity consumed in
the current year than we get the same answer using Laspeyres’ and Paasche’s index
numbers.
Disadvantage

As this method is computed using the quantities consumed in the current year, obtaining the
data is quite a time consuming and expensive affair. Moreover, unlike Laspeyres Index
Number, it is difficult to compare indexes of different periods, since to compute indexes of
each year it is required to re-compute the effect of the previous year.
Example 14.8: For the following data, calculate the price index number of 2009 with 2008 as
base year using Laspeyres’ method and Paasche’s method.

Items! Quantity consumed in 2018 | Quantity consumed in 2019 Price in 2018 Pelee In fot
A 20 30 10 15

B 30 20 12 10

Cc 40 60 14 20

D 50 50 20 25

E 60 40 25 22

Total 200 200 81 92

Solution:

Laspeyres’ Method
Items A q: Po Py PiGo Poo

A 20 30 10 15 300 200

B 30 20 12 10 300 360

C 40 60 14 20 800 560

D 50 50 | 20 | 25 | 1,250 | 1,000
E 60 40 25 22 1,320 1,500

Total 200 | 200 | 81 | 92 | 3,970 | 3,620

Laspeyre's Price Index = DPido , 100


> Pode

=3970+3620=109.6685

256
OPMCO001
Business Statistics

Paasche’s Method

Items Qo q Po Py Pio Pode

A 20 30 10 15 450 300

B 30 20 12 10 200 240

Cc 40 60 14 20 1,200 840

D 50 50 20 25 1,250 | 1,000
E 60 40 25 22 880 1,000

Total 200 200 81 92 3,980 | 3,380

Paasche's Price Index = 2Prd x 100


» PoGQ

= 3,980 + 3,380 = 117.7515

Laspeyres' price index shows a price level increase of 9.67%, whereas Paasche's price index
shows a price level increase of 17.75%.
c) Dorbish and Bowley's method

Laspeyres' method is based on the impact of quantities of the base year, and Paasche's
method is based on the impact of quantities of the current year. To capture the influence of
both the periods i.e. base period and current period, Dorbish and Bowley in 1901 suggested
to take the average of both the indexes. The formula for this Index is given as

Dorbish and Bowley's Price Index= AL


> Pido , Pid || x 100
> Podo Foal)"
"> 'poai

Itis an average of the two methods.

d) Fisher’s Ideal Index

In 1920, Irving Fisher proposed to estimate the index number using the geometric mean of
Laspeyres’ Index and Paasche’s Index. So, it is also called Fisher’s Ideal Index Number, the
formula is given by:

Fisher's Ideal Price Index = 2PM Pid x 100


Pode Spo
257
UNIT 14
Index Number

Advantages
® Thegeometric mean isthe besttool for constructing index numbers.

e As the quantities are used as weights for both the current and base period, it
avoids the biases associated with both Laspeyres’ and Paasche’s index number.
@ This method also satisfies the two important tests required for the index i.e. the
time-reversal test and factor-reversal test.
Disadvantage
Although the index is theoretically better it is not used often as this method needs a lot of
computation time.
Example 14.9: For the following data, calculate the price index number of 2009 with 2008 as
base year using Fishers ideal method.

Items| Quantity consumed in 2018 | Quantity consumed in 2019 ie tae an


A 20 30 10 15

B 30 20 12 10

Cc 40 60 14 20

D 50 50 20 25

E 60 40 25 22

Total 200 200 81 92

Solution:

ITEMS qo a: P| PA Pde | Pao | Pir | P.O


A 20 30 10 15 300 200 450 300

B 30 20 12 10 300 360 200 240

Cc 40 60 14 20 800 560 1200 840

D 50 50 20 25 1250 1000 | 1250 1000

E 60 40 25 22 1320 1500 880 1000

Total 200 | 200 | 81 | 92 | 3970 | 3620 | 3980 | 3380

Fisher's Ideal Price Index = DPio DP t x100


Pode Pod

Fisher's Ideal Price Index =, x —_ x100 =113.638


258 3620 3380
OPMC001
Business Statistics

Hence, we conclude that in the year 2019 the price index has increased by 13.64%
Example 14.10 For the following data, calculate the price index number of 2009 with 2008
as base year using Fishers ideal method.

Items| Expenditure on Quantity Expenditure on Quantity | Price in 2018) Price in 2019


consumed in 2018 consumed in 2019 (Rs) (Rs)

A 200 300 10 15

B 300 200 12 10

Cc 420 600 i5 20

D 500 500 20 25

E 600 400 25 20

Solution:
There is a difference between the data of the previous example and this one. Here instead of
quantities used the data given is on expenditure. So we need to calculate the quantity based
on expenditure and price. Quantity consumed = expenditure + price.

Items expenditure expenditure Gs Qs Po | Ps Pro | Poo | Psa Pod


in 2018 in 2019

A 200 300 20 | 20 | 10/15 | 300 |200 | 300 | 200

B 300 200 25 | 20 | 12/10 | 250 |300 | 200 | 240

Cc 420 600 28 | 30 | 15/20 | 560 |420 | 600 | 450

D 500 500 25 | 20 | 20 /| 25 | 625 |500 | 500 | 400


E 600 400 24 | 20 | 25 | 20 | 480 | 600 | 400 500

Total 2020 2000 122 |110 | 82 | 90 |2,2152,020|2,000| 1,790

Fisher's Weel Price Index = [2a 55 20%. 769


> Pod > Pott

. . 2,215 2,000
Fisher's Ideal Price Index = x x100 =110.688
2,020 1,790

259
UNIT 14
Index Number

NOTES e) Marshall-Edgeworth method


Marshall — Edgeworth proposed to take the arithmetic quantity of both the current year as
wellas the base year. The formula for computing index is given by
+

Marshall Edgeworth Price Index =2Pi(d +41) , 100 simplifying we get


> Po(do +41)

Marshall Edgeworth Price Index = LPido


+ LP 99
+

> Podo + > Pot

This takes the average of quantities consumed, (the division by 2 cancels out) and thus takes
into account both the altered quantities due to change in prices as well as the original
quantities as the consumer may have wished to consume in the current year as much as the
previous year — unable to do so due to price inflation. It provides due weightage to this
aspect.

f) Kelly’s method

This method was proposed by Truman L Kelly, he suggested the fixed weight approach to
estimate the index number

Kelly's Price Index = BP 109


> Pod

Here q is a quantity that may not necessarily refer to the base or current year. If q is the
average quantity of two years then q = (q, + g,)+2. Similarly average or 3 or more years can
also be used as weights. The logicis the same as Marshall-Edgeworth.

Advantages:
The main advantage of this method is that it does not require the information regarding the
yearly changes in the weights. A basket for various years’ consumption can be a better
indicator. This can improve the accuracy of the index number. The only point to remember is
that the weight should be appropriate and indicate the relative importance of various
commodities.

Disadvantage
The index does not take either the base year or current year as a fixed weight.

g) Walsch’s Method

Correa Walsch proposed the formula in 1901. He proposed that the weight be used as the
geometric mean of the base year and current year quantities. The formula is given below

Walsh's Price Index = 22P:V%% 199


260 > Poo
OPMC001
Business Statistics

It satisfies the time-reversal test


Example 14.11: Construct weighted aggregate index numbers of price from the following
data by applying
1. Laspeyres’ method

2 Paasche’s method

z Dorbish and Bowley’s method

4. _ Fisher’sideal method
5 Marshall-Edgeworth method
6 Kelly’s method

7 Walsch’s method

Items | Quantity consumed in 2018 | Quantity consumed in 2019 eo ee


A 20 30 10 15

B 30 20 12 10

Cc 40 60 14 20

total 90 110 36 45

Solution:

Items Qo qh Po | Pr | Pie | PoGo | Padi Pods

A 20 | 30 | 10) 15 | 300 | 200 | 450 | 300

B 30 | 20 | 12 | 10 | 300 | 360 | 200 | 240

C 40 | 60 | 14 | 20 | 800 |560 |1200| 840

Total | 90 |110 | 36 | 45 |1,4004,120|1,850| 1,380

(1) Laspeyres’ Index:

Laspeyre's Price Index =2Pid x 100


> Poo

+ (1,400 + 1,120) x 100 = 125%

261
UNIT 14
Index Number

(2) Paasche’s Index

Paasche's Price Index= 2PM x 100


D Pot
=(1,850+ 1,380)x 100 = 134.057

(3} Dorbishand Bowley’s Index


,
Doris and Bowls Pe Index = LPito , DP | 199
>) Pode > Pod I

= [(1.25 + 1.34)+2]x 100 =129.5%

(4) — Fisher’s Ideal Index

Fisher's Ideal Price Index = eee eae rng


> Pode > Pod

= [SQRT (1.25 x 1.34)]x 100 =129.42%

(5) | Marshall-Edgeworth Method

Marshall Edgeworth Price Index = Pid + Pid 100


> Pod + > Pod:

= [(1,400 + 1,850) + (1,120 + 1,1380)] X 100

= (3,250 + 2,500) x 100 = 130%

Now for calculating Kelly’s and Walsch’s Price Index we need to calculate q and geometric
mean ofq, &q,

q=(q,+q,)/*
Q=sqrt(q,q,)

Items | 4% | 9 | Po | Pa q|}pqaspaq| Q | pxQ|p,XQ

A 20 | 30 | 10/15 25 | 375 | 250 | 24.49| 244.95 | 367.42

B 30 | 20 | 12 10 25 | 250 | 300 | 24.49) 293.94 | 244.95

Cc 40 | 60 | 14 | 20 50 |1,000| 700 | 48.99| 685.86 | 979.8

Total | 90 |110 | 36 | 45 | 100 |1,625 1,250) 97.98 |1,224.74|1,592.17

262
OPMC001
Business Statistics

(6) KellyMethod

Kelly's Price Index = dP x 100


PoF

=(1,625+1,250)x100 = 130%

(7) Walsch’e Method

Walsh's Price Index =DPV V40%


40% 100
YP 9%

= (1,592.17 + 1,224.74)x 100 =130%

Example 14.12 Calculate the suitable price index for the following data.

Commodity Quantity Price

2015 2016

A 20 3 7

B 25 5 8

Cc 35 6 9

Solution: Kelly’s Index price number is the most suitable index number as the quantities for
both the current year and base year are the same

Commodity q B, P, Pod pq

A 20 | 3 7 60 140

B 25 | 5 8 125 200

c 35 | 6 9 210 315

395 655

Kelly’s price Index number:

Kelly's Price Index = LPI. 109


> Pod
= 655 + 395 x 100
= 165.82

263
UNIT 14
Index Number

Examples based on Section 14,7

14.7 Calculate the price indices from the following data by applying -

(1) _ Laspeyres’ method


(2} Paasche’s method, and

(3) Fisher ideal number by taking 2010 as the base year.

Commodity 2010 2011

Prices | Quantities Prices Quantities

Oil 20 10 25 13

Pulses 50 8 60 7

Sugar 35 7 40 6

Wheat 25 5 35 4

14.8 Calculate the Dorbish and Bowley’s price index number for the following data taking
2014 as the base year.
Items 2014 2015

Prices | Quantities) Prices Quantities

A 50 30 54 35

B 35 2 45 3

Cc 80 3 100 4

D 25 2 30 3

E 35 2 40 3

Example 14.9: Compute Marshall-Edgeworth price index number for the following data by
taking 2016 as the base year.

Items purchased 2014 2015


from shopper stoP| prices | Quantities Prices | Quantities
Perfumes 700 150 900 175
footwear 1,000 100 1,200 150
Imitation 500 70 600 100
Watches 1,500 50 1,800 60
Bags 400 100 600 150
Cosmetics 1,200 300 1,500 250
264
OPMCO001
Business Statistics

Example 14.10: Calculate the suitable price index for the following data:

Commodity Shops Price

2015 2016

A 30 4 6

B 15 3 7

c 20 6 8

2. WEIGHTED AVERAGE OF PRICE RELATIVES

The weighted average of price relatives is computed by introducing weights into the
unweighted price relatives. The weights are determined by the value consumed in the base
period for weighting the commodities. As shown previously, we may use either arithmetic
mean or the geometric mean to average weighted price relatives. The weights are used to
reflect the consumption levels of the average consumer.
The weighted average price relatives using arithmetic mean:
lf the price relative index p = [p,* p,] x 100 and w=p,q,, then the weighed price relative index
is:

y| 2 100)! Po
P0 J
Po =
> Poo
x wP
Po =
=
The weighted average price relatives using a geometric mean:

Py = anti oe
zee)
ree?
yw
Example 14.13

Compute the price index for the following data by applying the weighted average of price
relative method using (i) Arithmetic mean and (ii) Geometric mean.

Item P, | Quantity consumed/kg P,


X 30 20 35
Y 20 40 22
Zz 10 1 11
Solutions: ° 265
UNIT 14
Index Number

Computation for the weighted average of price relatives using arithmetic mean.

Items | Po Quantity | p, | w=p,Xq,|P=(p,+q,),X 100) logP wP | wLlogP


consumed/kg

x 30 20 35 600 116.67 2.07 | 70,000 /1,240.17

Y 20 40 22 800 110 2.04 | 88,000 /1,633.11

Z 10 10 11 100 110 2.04 | 11,000 | 204.14

1,500 1,69,000/3,077.42

y wP
P,, = <4— = 169,000 + 1,500 = 112.67
Ww

This means that there has been a 12.67% increase in prices over the base year.
The index number using the geometric mean of price relatives is:

wlo
P,, = antilog [Sees = antilog (3,077.42 + 1,500)

= antilog (2.0516)
= 112.62
This means that there has been a 12.62% increase in prices over the base year.

3. QUANTITYINDEX NUMBER
Just like the Price Index number the quantity index number measures the changes in the
level of quantities of items consumed, or produced, or distributed during a year under
investigation compared to another year known as the base year.

Laspeyre's Quantity Index = », ae x 100


>» GoPo

Paasche's Quantity Index = 2 uP x 100


> oP;

Fisher's Ideal Quantity Index = LGPo , LUGUPr 199


» GoPo YaoP:

266
OPMCO001
Business Statistics

Similar to price index these formulae measure the quantity index in which quantities of the
different commodities are weighted by their prices.

Example 14.14
Compute the following quantity indices from the data given below:

(Il) Laspeyres’ quantity index


(ii} | Paasche’s quantity index and
(iii) | Fisher’s quantity index

Item 2010 2015

Price total value} Price total value

A 10 100 15 225

B 15 105 18 378

Cc 8 152 11 242

Solution:

Since we are given the value and the prices, the quantity figures can be obtained by dividing
the value by the price for each of the commodities.

Items Po qo Pr | PoFo | P14 | PoGi | Pio:

A 10 | 10 | 15| 15 | 100 |150 | 150| 225


B 15 | 7 | 18| 21 | 105 |126/) 315/] 378
c 8 | 19 | 11| 22 | 152 |209| 176| 242
TOTAL 357 |485 | 641) 845

(i) | Laspeyres’ quantity index

Laspeyre's Quantity Index = LP o x 100


>» GoPo
= 640 + 357 x 100
= 179.55

{ii} | Paasche’s quantity index

Paasche's Quantity Index = 2d am x 100


YaoP: 267
UNIT 14
Index Number

= 845 = 485 x 100

= 174.23

(iii) Fisher’s quantity index

Fisher's Ideal Quantity Index = LGPo, LGWPr 99


>» oPo > oP:

= SQRT (179.55 x 174.23)

= 176.87

14.7 TESTS TO VERIFY THE CONSISTENCY OR ADEQUACY OF AN INDEX NUMBER

Many researchers have suggested different formulas to verify the consistency or adequacy
of anindex number. Some of the most used tests are given below

e Order reversal test

e Timereversal test
e Factor reversal test

e Circulartest
e Unittest

It is not possible by any particular formula of an index number to satisfy all the tests
mentioned above. An ideal formula is the one that satisfies the maximum possible relevant
tests under study.

Order Reversal Test

According to the order reversal test even if the arrangement of the items is reversed, the
value of the index number should not change. All the twelve methods of index number
satisfy the order reversal test.
Time Reversal Test

The time-reversal test was proposed by Irving Fisher. According to Fisher, “Time-reversal test
is the test which gives the same ratio between one point of comparison with other for the
calculation of index number irrespective of the fact which of the two is taken as a base”. This
test maintains the time consistency by working in both the direction i.e. forward and
backward with time. In simple words, if the product of index number results in unity when
the base year are interchangedi.e. P,, x P,,=1

The time-reversal test is satisfied by simple aggregative method, Fisher’s method, Marshall-
Edgeworth’s method and Kelly’s method.

268
OPMC001
Business Statistics

_ {UPd > Pid


“YS Pod > Pott
when the base year and current year are interchanged we get

_ {2 Po% 2 Podo
VS pa, “Pio

LPs 4 DuP od > Poo


P.. me xP, =
pao VP
TSDPid VP YP.
Pu xP, =1

Factor Reversal Test

This test is also suggested by Fisher. In Fisher’s words “Just as each formula should permit the
interchange of two time periods without inconsistent results, similarly it ought to permit
interchanging the prices and quantities without giving inconsistent results i.e., the two
results multiplied together should give the true ratio.”

This test too has been proposed by Prof. Irving Fisher, according to him a formula of index
number should be able to give consistent results even if price and quantity factors are
interchanged, i.e. Price indexx Quantity Index
= Value Index.

Except for the Fisher Ideal index number, none of the formulas discussed above satisfy this
test.

_ {LUPiM > P41


“VS pode > Pod

Qn 1 GiPo a 9,P;
Vy GoPo °F GoPi

P.xQ DP. Y pid i DaiPo > ap,


1 VS Dodo DiPot YidoPo Dido?
> Pid
Px a1 *On Nor = > Pods

Circular Test

It is an extension of the time-reversal test. The time-reversal test takes into account only two
years, i.e. the current and base years. This test requires that an index number formula should
be such that it works circularly. An index number is said to satisfy the circular test when there
are three indices, P,,, P,,and P,,, such that P,, x P,, x P,,=1.
269
UNIT 14
Index Number

The circular test is not satisfied by the weighted aggregative method. This test is satisfied by a
simple aggregative method.

Laspeyres, Paasche’s, Fisher’s ideal index, Marshall and Edgeworth’s, Dorbish and Bowley’s,
etc. do not satisfy this test.
However, there are the following three methods which do satisfy the test

e Simpleaggregative method
e Weighted aggregative method

e —_ Kelley’s Method
Unit Test

If the formula of the index number is such that the value of the index number is not affected
when the units of prices are altered i.e. weights in kg are converted to weights in quintal or
vice versa then it satisfies the criteria of the Unit test. The formula of all the index numbers
satisfies this test except the simple aggregative method in which the units of the price of any
item changes index number changed drastically.
Example 14.15

The table below provides the prices of the base year and current year of 5 commodities with
their quantities. Use it to verify whether Fisher’s ideal index satisfies the time-reversal test.

Item Base Year Current Year

Price Quantity Price Quantity

A 10 10 15 15

B 15 7 18 21

Cc 12 8 15 12

D 18 12 22 15

E 8 19 11 22

Solution:

Index number by Fisher’s ideal index method


Items Po | G | Ps | Gi | Pode | Pido | Pods) Pid:

A 10 | 10 | 15/ 15 100 | 150 150/ 225

B 15 7 18 | 21 105 | 126 | 315| 378

Cc 12 8 15} 12 96 |120| 144| 180

D 18 | 12 | 22| 15 216 | 264 | 270| 330

E 8 19 | 11) 22 152 | 209 | 176| 242

270 TOTAL 669 | 869 | 1,055) 1,355


OPMCO001
Business Statistics

_{QUPdo DP _ [869 1355 _


“V¥ pods Dipoa, ¥669- 1055

_ [QP D> Pode _ 1055 669 = 0.7742


VY pa pido V1355 869

Po XP = 1
Therefore, Fisher’s Index number satisfies the time-reversal test.

Example 14.16

Calculate the price index and quantity index for the following data by Fisher’s ideal formula
and verify that it satisfies the factor reversal test.

Item Base Year Current Year

Price Quantity Price Quantity

A 10 10 15 15

B 15 7 18 21

Cc 12 8 15 12

D 18 12 22 15

E 8 19 11 22

Solution:
Index number by Fisher’s ideal index method

Items Po | Go | Pa | 9s | PoGo | Pio) Pod | Pah


A 10 | 10 | 15}; 15 100 |150 ) 150| 225

B 15 7 18 | 21 105 (126) 315) 378

C 12 8 15 | 12 96 |120) 144/| 180

D 18 | 12 | 22| 15 216 | 264 | 270| 330

E 8 19 | 11) 22 152 | 209 176| 242

TOTAL 669 | 869 | 1,055) 1,355

2, = | P DPd
> Pode Spa

271
UNIT 14
Index Number

_{% GiPo ae iP;


VW oPo ‘Sy GoPi

Px Q, = DP ido > Pid i aiPo api


_™ > Pode “> pod *S aoPs Sab:

2 1355 1055 1355


Py, X Qa = x x x
669 1055 669 869

1355) 1355
Por X Qoi = (2) = 665

Po: * Qo: = 2a
odo

Hence, Fisher ideal index number satisfies the factor reversal test

14.8 INDEXNUMBERSININDIA

Below we shall discuss some of the popular Index used in India


1) Consumer Price Index (CPI)

2) ~=NIFTY50
3) S&P BSE Sensex
Consumer Price Index Number (CPI)

The Consumer Price Index, which, is commonly known as CPI is an economic indicator. Itis a
tool that examines the effect of changes in prices, for a basket of consumer goods and
services, such as transportation, food, medical care, etc. is used as a measure of inflation. CPI
is a weighted average represented in terms of percentage and is estimated by taking into
account the changes in prices of each item in a predetermined basket of goods. CPI aims to
compare the consistent base of products from year to year, focusing on the products that are
brought and used by consumers daily It is one of the most used tools of statistics which helps
in identifying the periods which shows fluctuations in the process like inflation or deflations.
The changes in the prices of commodities, impact the cost of living, of a diverse group of
population ina different way. As the consumption patterns of commodities differ, in different
groups of society. The general index number fails to take this into account.
Uses of the Consumer Price Index (CPI)

1. CPI is used to calculated wages and dearness allowance adjustments in many


countries.
272
OPMCO001
Business Statistics

2. Government uses CPI for estimating wage policy, price policy, rent control, taxation
and general economic policies.

3. | An awareness about price changes in the economy, it can act as a guide for making
informed decisions about the economy and budgetary provisions.

4. CPlis also used for studying market price for a particular kind of goods and services.

Methods of constructing consumer price Index


There are two methods of constructing
the consumer price index. They are:

1. Aggregate Expenditure Method (or) Aggregate Method


2. Family Budget method or Method of weighted relative.

1. Aggregate Expenditure method


This method is based on the Laspeyres’ method. It is widely used. The quantities of
commodities consumed by a particular group in the base year are the weight.

> P14
the consumer price index number = ¥ x100

2. FamilyBudget method (or) Method of weight relatives


This method estimates the weighted average by taking an aggregate expenditure of an
average family on various items. It is given by

the consumer priceindexnumber=


iw
Where, p=p,/ p,x 100 for each item and w= p,q,
The family budget method is the same as the “weighted average price relative method”
studied earlier.

Example 14.17
Calculate the consumer price index number for 2015 based on 2000 from the following data
by using (i) the Aggregate expenditure method (ii) the family budget (or) weighted relatives
method.
Commodity Quantity Price

2000 2010

Oil 2 80 120

Rice 10 25 35

Wheat 20 20 40

Sugar 8 30 32

273
UNIT 14
Index Number

Solution

Commodity Go Po | Ps | PoGo | Pid,


Oil 2 80 | 120 | 160 | 250
Rice 10 25 | 35 | 250 | 350
Wheat 20 20| 40 | 400 | 800
Sugar 8 30 32 | 240 | 256
Total 1,050 | 1,646

(i) Calculation of cost of living index number based on the Aggregate expenditure
Method.
The consumer price index number = 2Pide x 100
> Po4do

=156.76Consumer price index number for 2010

(ii) Calculation of consumer price index number according to family budget method or
Weighted Relative Method.

Commodity | 4% Po | Pr | W=PoXQ_ | P=(p,/qq,X 100} wP

oil 2 80 |120 160 150 24,000

Rice 10 25 | 35 250 140 35,000

Wheat 20 20 | 40 400 200 80,000

Sugar 8 30 | 32 240 106.67 25,000

Total 1,050 1,64,600

Consumer price index number for 2010

wp
the consumer price index number = x w = 1,64,600+ 1,050=156.76

NIFTY50

NIFTY 50 Index is the National Stock Exchange (NSE) of India’s popular stock market index. It
represents the weighted average of 50 of the largest Indian companies listed on NSE. It is one
of the two stock indices used in India, the other being the BSE-SENSEX. NIFTY was launched
in 1996 and is owned and managed by NSE Indices (previously known as India Index Services
and Products Limited), a wholly owned subsidiary of the NSE Strategic Investment
Corporation Limited. NIFTY 50 covers 14 sectors (as of 20 June 2020) of the Indian economy
and offers investment manager sexposure to the Indian market in one portfolio.
The NIFTY 50 Index gives a weightage of 39.47% to financial sectors, 15.31% to Energy,
274
OPMC001
Business Statistics

>
13.01% to IT, 12.38% to consumer goods, 6.11% to automobiles and 0% to the agricultural
sector.

The base value of the index has been set at 1,000 to the base date of November 31,995. Itisa
free-float market capitalization-weighted index. i.e. a floating factor is assigned to each stock
to account for the proportion of outstanding shares that are held by the general public, as
opposed to closely held shares owned by government, royalty or company insiders,
5 &P BSE Sensex

The S&P BSE SENSEX (S&P Bombay Stock Exchange Sensitivity Index) is one of the globally
renowned stock market indexes. This index reflects the free-float market-weighted stock
market index of 30, well established, and financially sounds companies. The 30 companies
that are representative of the various industrial sectors of the Indian economy are selected.
It was compiled in the year 1986, the base value is taken as 100 fixed with the base year
1978-79.

14.9 LETUSSUMUP

\
1. Price relative in period n,P,n= — x100
0

Quantity relative in period n, Q,,= Pa sing


> 4

Value relative in period n, V,,= oe to 100


Poo

2. Unweighted Aggregate Price Indexin periodn

Pas = Lr, 100


DP
The simple average of price relative

> ;2*100
Pp. = —2 _
N
Simple Geometric Mean of Price relative

log [7 «10 |
Py
P,, = antilog
N
275
UNIT 14
Index Number

The simple aggregate quantity index

Qo. = 24s
va 100

3. Weighted aggregate price index


(a) The weighted aggregate method in period in

p = 2P4.199
" >po

Laspeyre's Price Index= 2Pi%0eS.


Spa

_ LPL,
Paasche's Price Index =
xo

Dorbish and Bowley’s Price Index = F 2 Pid 4 Pid | x 100


>: Pode Taal
ys Pods

Fisher's Ideal Price Index = DPM DPidt x 100


Y Pode > Pod

+ Piss,
Marshall Edeeworth Price Index= LPG
+

Spode + Sha
Pod

Kelly's Price Index= 2 Pa


pe

Walsh's Price Index = LPGv Goh 100


» PoV%%

(b) Weighted average of price relatives in periodn

_ ie
yw’
where w= p,q, and P= Po 100
276
Po
OPMCO001
Business Statistics

The weighted average price relatives using a geometric mean:

Py, = anti oe

4. Quantity index
(a) Unweighted quantity index for periodn

> 4. 100
Qo = >a,

The simple average of quantity relative

> «100
Qon = —40
N __

(b) Weighted quantity index

_ 24.
"Yaw

5. Test for consistency

Time reversal test: P,,XP,,=1

Factor reversal test: P,. XP_,.=


Pada
> Pode

Circulartest: Py,X Py.X Poy Xseseeserevenee P,XP,=1

14.10 KEYWORDS

Index Number:A ratio that measures the change in the variable over some time.

Base period: It is the reference period against which comparisons are made.
Consumer Price Index Number: The average change in the prices paid by the consumer on
specified goods and services over some time, sometimes referred to as the Cost of Living
Index.

Fixed Weight Aggregate Method: Quantities consumed in the specific period are used as
weights to calculate the aggregate index.
Laspeyres’ Method:
To measure the aggregate index, this method uses quantities consumed
277
UNIT 14
Index Number

in the base period as weights.


Paasche’s Method: To measure the aggregate index, this method uses quantities consumed
inthe current period as weights.
Percentage Relative: It is a ratio that measures the change in current value to the base value
multiplied by 100.

Price Index: It is used to compare the changes in the prices of commodities from one period
to another.

Quantity Index: It is used to study the change in consumption of quantity of commodities


from one period to another.

Value Index: It is a measure to study the changes in total monetary worth over atime.

Unweighted Aggregative Price Index: All the values are assigned equal importance to
estimate the changes in prices, over time, for an entire group of commodities.

Unweighted average of price relatives: it is the average of price relatives for all items. The
average could be arithmetic mean or geometric mean.
Weighted aggregate index: It is an aggregate of items that have been weighted in some way
either by corresponding quantities produced, consumed, or sold to reflect
their importance.
The weighted average of relatives method: The average is estimated by multiplying price
relative by its weight and the total quantity consumed is considered as weight.

14.11 SELF-ASSESSMENT QUESTIONS

14.11 The following data shows the monthly rent of a 2BHK house in different locations
over three years in a location. Calculate simple aggregate price index number for the year
2006 and 2007 using 2005 as the base year.

Area House Rent

2005 2006 2007

A 15,000 16,000 20,000

B 17,000 20,000 22,000

Cc 18,500 21,000 25,000

D 10,500 12,000 15,000

14.12 The data below describe the average salary for the employees in the company over
the past 10 consecutive years. Calculate an index for these averages using year 4 as the base
year. Calculate percentage points change between consecutive years.

278
OPMCO001
Business Statistics

Year Salary

1 9,800

2 10,200

3 11,000

4 12,400

5 13,100

6 14,100

7 14,900

8 15,700

9 16,400

10 17,800

14.13 Below are the prices for different commodities for the years 2008 and 2009.
Calculate the price index based on price relatives using the geometric mean.

Items 2008 2009

A 43 50

B 48 55

c 32 5

D 61 64

E 40 43

F 52 55

14.14 The price for consecutive four years for a men’s clothing brand is given below.
Calculate an unweighted average of price relatives index for each year using 2000 as the base
year.

Products 2000 2001 2002 2003

Trousers 1,500 1,760 1,700 1,900

Jeans 2,000 2,000 2,200 2,300

Shirts 1,000 1,200 1,200 1,300

279
UNIT 14
Index Number

14.15 Calculate the weighted average of relatives quantity indices using price and quantity
from 1995 to compute value weights, with 1995 as the base year.

Model Numbers sold (in lakh 5)


1995 | 1996 1997 1995

Sedan 45 48 56 13.9

Hatchback 64 67 71 8.3

SUV 28 35 27 23.8

MUV 21 16 28 15.7

14.16 Abook publishing house is interested to know whether the sales have changed after
the release of the first edition of the book. Using 2011 as a base calculate the unweighted
aggregate quantity index for 2012 and 2013.

Books 2011 2012 2013

English 11 8 15

Maths 27 26 30

Science 10 26 32

Social Studies 24 18 26

physics 16 20 21

Chemistry 19 15 22

Biology 32 37 35

Economics 48 53 50

280
OPMC001
Business Statistics

14.17 Thedata given belowis the price and respective quantities sold by a farmer for crops
grown in past years. Construct Laspeyres’, Paasche’s, Dorbish & Bowley’s, Marshall
Edgeworth index.

2010 2011

Crops Prices | Quantities) Prices Quantities

wheat 30 120 40 125

potato 10 90 12 105

corn 20 50 29 60

maize 24 130 28 150

14.18 The data given below are the no. of individuals who have taken personal health
insurance to calculate the unweighted average of the relative price index of each year.

Profession 1997 1998 1999 2000

Doctors 54 65 86 103

Fireman 39 41 55 76

Policeman 48 61 76 93

Teachers 46 58 75 96

14,19 In 1998, the average monthly wage for teachers was Rs 1,42,600. In 2002 the
average monthly wage for the same group was Rs 1,52,800. The consumer price index in
2002 using 1998 as the base period was 148. Calculate the real average monthly wage for this
groupin 1998.

14.20 Alocaljam manufacturing company feels that the sales are changing of its four most
best selling flavours the data for the years 2000 through 2004. Calculate fixed weight
aggregates index for each year using 2000 prices as the base and the 2004 quantities as fixed
weights.

Price per unit Price per unit

Flavours 2000 | 2001 | 2002 | 2003 | 2000 | 2001 | 2002 | 2003

Orange 58 62 | 69 79 21 25 20 18

Mixed Fruit} 189 | 209 | 218 | 225 15 12 |) 18 21

Pine apple | 84 89 | 99 99 29 27 | 23 24

Apple 91 | 99 | 114 | 119 | 31 | 24 | 20 | 16


281
Institute of
Management Technology BUSINESS
Centre for Distance Learning, Ghaziabad
° STATISTICS

Published by: Institute of Management Technology, ISBN: 978-81-951960-6-7


Centre for Distance Learning, Ghaziabad | | | |

9" 788195 " 196067


Printed by: Utility Forms Pvt. Ltd., New Delhi

You might also like