0% found this document useful (0 votes)
3 views

Basic Statist final.-2

The Basic Statistics module at Jimma University aims to provide students with a foundational understanding of statistical techniques relevant to business decision-making. It covers key topics such as data collection, measures of central tendency, dispersion, and probability theory, structured across four chapters. The course emphasizes the importance of statistical knowledge in analyzing and interpreting data to solve business problems effectively.

Uploaded by

soadslt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Basic Statist final.-2

The Basic Statistics module at Jimma University aims to provide students with a foundational understanding of statistical techniques relevant to business decision-making. It covers key topics such as data collection, measures of central tendency, dispersion, and probability theory, structured across four chapters. The course emphasizes the importance of statistical knowledge in analyzing and interpreting data to solve business problems effectively.

Uploaded by

soadslt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 218

Basic Statistics

JIMMA UNIVERSITY

COLLEGE OF NATURAL SCIENCES

DEPARTMENT OF STATISTICS

BASIC STATISTICS MODULE

Prepared By:

Mr. Abiy Disasa (MSc in Biostatistics)


Mr. Reta Habtamu (MSc in Mathematical and statistical Modeling)
Mr. Samuel Fikadu (MSc in Biostatistics)

Edited By:

Mr. Abiyot Negash (MSc in Biostatistics)

August 02, 2021

Jimma, Ethiopia

i
Basic Statistics

Module Introduction

Dear students, first and for most you are warmly welcome to the course ‘Basic Statistics’. In
this course, you will provide students with a general understanding of statistical techniques
commonly used in solving business problems and undertaking business research. Topics
include frequency distributions, measures of central tendency, dispersion, and probability
theory.

Quite understandably, as a systematized set of activities and functions business involves the
application of both qualitative and quantitative techniques. Relating to our purpose, statistical
techniques are getting prominence in the face of changing technologies and complexities in
business and industry. Thus, these techniques are now considered as effective tools towards
solving business problems in addition to constituting an important segment of the study of
business in general. They can, however, never be substituted for human skills, experience, and
judgment. Many activities that are previously handled by verbal analysis and description have
proved to be more easily dealt with statistical techniques. The use of statistical tools can give
clarity and certainty in handling problems and enforce precision in stating the fact of situation
where these would otherwise be lost in emotion and argument. The use of statistical
knowledge in the field of business aid dated many years back. In recent years, an
understanding of statistical methods, techniques, and the skills to make use of them had
widely been recognized more than before. It is essential for anyone making business decisions
on the basis of data to possess a clear understanding of statistics. Among others, the vast and
fast changing technological, financial and economic setting has necessitated an organized use
and extensive application of statistical tools to business decision making. Statistics has proved
useful in many ways such as in establishing relationship, making predictions, and providing
solution to the many problems of business operations and managerial decision. Statistics is
widely applied in production and quality control, marketing research, manpower planning,
finance, etc.

The objective of this module enable students apply basic statistical techniques and methods
for grouping, tabular and graphical display, analysis and interpretation of statistical data.

This course aims:

ii
Basic Statistics

 To demonstrate understanding of key statistical terms


 Provide a working knowledge of the statistical tools used in business
 Introduce techniques of gathering, organizing, presenting and interpreting statistical data
 Describe a set of grouped and ungrouped data by measuring the central tendency and the
variability of the data

Enable students apply the concept of probability to quantify uncertainty and assess business
risk.. Besides, it is also to create know-how to students on various application areas and
benefit of statistics in business. The module is therefore designed to address your need to be
acquainted with the basic concepts and applications of statistics in executing business
activities. In accordance, this module is comprised of four chapters with respective sections
and subsections. The first chapter will introduce you to the introduction. Likewise, the second
chapter will be Visual Description of Data. The third chapter is mainly discussing about
statistical description of data. The fourth unit will be probability and Probability Distribution.
To facilitate your study, all sections and subsections in each chapter are made to include
examples with detail explanation of steps and procedures used in solving them. In addition,
you will find chapter summary and review problems at the end. The module will also provide
you with the self-assessment questions and exercises. review problems.

iii
Basic Statistics

Table of Contents

Module Introduction ....................................................................................................................... ii

CHAPTER ONE: INTRODUCTION ............................................................................................. 1

1.1. WHAT IS STATISTICS? ...................................................................................................................... 2


1.1.1. Meaning and definitions of statistics .............................................................................................. 2
1.2. DESCRIPTIVE VERSUS INFERENTIAL STATISTICS ............................................................................... 4
1.3. STAGES IN STATISTICAL INVESTIGATION ......................................................................................... 6
1.4. SOME BASIC DEFINITIONS ............................................................................................................. 11
1.5. TYPES OF VARIABLES, DATA AND SCALE OF MEASUREMENTS ........................................................ 11
1.5.1. Types of variables ......................................................................................................................... 11
1.6. TYPES AND SOURCE OF DATA ......................................................................................................... 13
1.7. MEASUREMENT SCALE AND SCALE TYPES ...................................................................................... 13
1.8. TYPES OF MEASUREMENT SCALES ................................................................................................. 15
1.9. STATISTICS AND ITS IMPORTANCE IN BUSINESS DECISIONS ............................................................. 18
1.10. SCOPE OF STATISTICS ..................................................................................................................... 20
1.11. LIMITATIONS OF STATISTICS .......................................................................................................... 22
1.12. IMPORTANCE OF STATISTICS IN BUSINESS...................................................................................... 24

CHAPTER TWO: VISUAL DESCRIPTION OF DATA ............................................................ 28

2.1. DATA TYPE AND METHODS OF DATA COLLECTION ....................................................................... 28


2.2. METHODS OF DATA COLLECTION ................................................................................................... 31
2.3. METHOD OF DATA PRESENTATION ................................................................................................. 35
2.4. RELATIVE FREQUENCY DISTRIBUTION ........................................................................................... 48
2.5. THE STEM-AND-LEAF DISPLAY AND THE DOTPLOT ....................................................................... 53
2.5.1. Steam and leaf plot........................................................................................................................ 53
2.6. OTHER METHODS FOR VISUAL REPRESENTATION OF THE DATA .................................................... 56
2.6.1. Diagrammatic presentation of data ............................................................................................... 56
2.3.2. Graphical presentation ........................................................................................................................ 60
2.7. TABULATION AND CONTINGENCY TABLES .................................................................................... 66

CHAPTER THREE: STATISTICAL DESCRIPTION OF DATA .............................................. 75

3.1. INTRODUCTION AND OBJECTIVES ................................................................................................... 75


3.2. THE SUMMATION NOTATION ......................................................................................................... 77
3.3. TYPES OF MEASURES OF CENTRAL TENDENCY ................................................................................ 78
3.3.1. Mean ............................................................................................................................................. 79

iv
Basic Statistics

3.3.2. Mode ............................................................................................................................................. 87


3.3.3. The Median and Other measures of Locations ................................................................................... 89
3.4. OTHER MEASURES OF LOCATION (QUANTILES).............................................................................. 93
3.5. STATISTICAL DESCRIPTION: MEASURES OF DISPERSION ................................................................ 96
3.5.1. Introduction and Objectives of Measuring Dispersion ................................................................. 96
3.6. MOMENTS, SKEWNESS AND KURTOSIS ........................................................................................ 120
3.7. STATISTICAL MEASURES OF ASSOCIATION .................................................................................. 124

CHAPTER FOUR: PROBABILITY AND PROBABILITY DISTRIBUTION ........................ 135

4.1. BASIC DEFINITIONS OF PROBABILITY ............................................................................................ 135


4.2. FUNDAMENTAL CONCEPTS: EXPERIMENT AND EVENT, EVENT AND THEIR RELATIONSHIPS,
CONDITIONAL AND JOINT PROBABILITY ............................................................................................................ 136
4.3. REVIEW OF SET THEORY ............................................................................................................... 141
4.4. COUNTING RULES ........................................................................................................................ 145
4.5. APPROACHES TO MEASURING PROBABILITY ................................................................................ 154
4.6. CONDITIONAL PROBABILITY AND INDEPENDENCY ....................................................................... 159
4.7. THEOREM OF TOTAL PROBABILITY AND BAYES’ THEOREM .......................................................... 162
4.8. PROBABILITY DISTRIBUTION ............................................................................................. 168
4.8.1. Definition of random variables and probability distributions ..................................................... 169
4.9. MEAN, VARIANCE AND EXPECTATION OF R.V ............................................................................. 173
4.10. COMMON DISCRETE PROBABILITY DISTRIBUTIONS ....................................................................... 175
4.11. COMMON CONTINUOUS PROBABILITY DISTRIBUTIONS ................................................................ 191

ASSIGNMENT QUESTION (LOAD: 30%).............................................................................. 211

v
Basic Statistics

CHAPTER ONE: INTRODUCTION

Dear students, this chapter will introduce you to the introduction. You are provided with a
general understanding of statistics and statistical terms in business, branches of statistics
,stages of statistical investigation process, types of variables, sources of data, measurement
scale with scale types, scope of statistics and also importance of statistics in business .

In general, this chapter aims:

 To demonstrate understanding of key statistical terms


 Provide a working knowledge of the statistical tools used in business
 Introduce techniques of gathering, organizing, presenting and interpreting statistical data
 To introduce what scale of measurement was used for a variable, because it determines what
statistics are appropriate to use in analyzing the data.

After reading this chapter, you should be able to


 Define and briefly explain statistics, it’s scope, it’s importance, it’s limitations and it’s
branches related with business
 Elaborate measurement scale with scale types
 Understand different stages of statistical investigation process
 List methods of data collection
 Explain the difference between method of data collection and research methods
 Define and explain the characteristics of each methods of data collection
 Explain the different modes of administration of the methods of data collection
 Explain the key characteristics of different types of interviews.

1
Basic Statistics

1.1. What is statistics?

For a layman, ‘Statistics’ means numerical information expressed in quantitative terms. This
information may relate to objects, subjects, activities, phenomena, or regions of space. As a
matter of fact, data have no limits as to their reference, coverage, and scope. At the macro
level, these are data on gross national product and shares of agriculture, manufacturing, and
services in GDP (Gross Domestic Product).

At the micro level, individual firms, howsoever small or large, produce extensive statistics on
their operations. The annual reports of companies contain variety of data on sales, production,
expenditure, inventories, capital employed, and other activities. These data are often field
data, collected by employing scientific survey techniques. Unless regularly updated, such data
are the product of a one-time effort and have limited use beyond the situation that may have
called for their collection. A student knows statistics more intimately as a subject of study like
economics, mathematics, chemistry, physics, and others. It is a discipline, which scientifically
deals with data, and is often described as the science of data. In dealing with statistics as data,
statistics has developed appropriate methods of collecting, presenting, summarizing, and
analyzing data, and thus consists of a body of these methods.

1.1.1. Meaning and definitions of statistics

In the beginning, it may be noted that the word ‘statistics’ is used rather curiously in two
senses plural and singular. In the plural sense, it refers to a set of figures or data. In the
singular sense, statistics refers to the whole body of tools that are used to collect data,
organize and interpret them and, finally, to draw conclusions from them. It should be noted
that both the aspects of statistics are important if the quantitative data are to serve their
purpose. If statistics, as a subject, is inadequate and consists of poor methodology, we could
not know the right procedure to extract from the data the information they contain. Similarly,
if our data are defective or that they are inadequate or inaccurate, we could not reach the right
conclusions even though our subject is well developed.

A.L. Bowley has defined statistics as: (i) statistics is the science of counting, (ii) Statistics
may rightly be called the science of averages, and (iii) statistics is the science of measurement
of social organism regarded as a whole in all its manifestations. Boddington defined as:
2
Basic Statistics

Statistics is the science of estimates and probabilities. Further, W.I. King has defined
Statistics in a wider context, the science of Statistics is the method of judging collective,
natural or social phenomena from the results obtained by the analysis or enumeration or
collection of estimates.

Seligman explored that statistics is a science that deals with the methods of collecting,
classifying, presenting, comparing and interpreting numerical data collected to throw some
light on any sphere of enquiry. Spiegal defines statistics highlighting its role in decision-
making particularly under uncertainty, as follows: statistics is concerned with scientific
method for collecting, organizing, summa rising, presenting and analyzing data as well as
drawing valid conclusions and making reasonable decisions on the basis of such analysis.
According to Prof. Horace Secrist, Statistics is the aggregate of facts, affected to a marked
extent by multiplicity of causes, numerically expressed, enumerated or estimated according to
reasonable standards of accuracy, collected in a systematic manner for a pre-determined
purpose, and placed in relation to each other.

From the above definitions, we can highlight the major characteristics of statistics as follows:

(i) Statistics are the aggregates of facts. It means a single figure is not statistics. For example, national
income of a country for a single year is not statistics but the same for two or more years is
statistics.
(ii) Statistics are affected by a number of factors. For example, sale of a product depends on a
number of factors such as its price, quality, competition, the income of the consumers, and so
on.
(iii) Statistics must be reasonably accurate. Wrong figures, if analyzed, will lead to erroneous
conclusions. Hence, it is necessary that conclusions must be based on accurate figures.
(iv) Statistics must be collected in a systematic manner. If data are collected in a haphazard
manner, they will not be reliable and will lead to misleading conclusions.
(v) Collected in a systematic manner for a pre-determined purpose
(vi) Lastly, Statistics should be placed in relation to each other. If one collects data unrelated to
each other, then such data will be confusing and will not lead to any logical conclusions. Data
should be comparable over time and over space.

3
Basic Statistics

1.2. Descriptive versus Inferential statistics

There are two major divisions of statistics such as descriptive statistics and inferential
statistics. Descriptive statistics; deals with collecting, summarizing, and simplifying data,
which are otherwise quite unwieldy and voluminous. It seeks to achieve this in a manner that
meaningful conclusions can be readily drawn from the data. Descriptive statistics may thus be
seen as comprising methods of bringing out and highlighting the latent characteristics present
in a set of numerical data. It not only facilitates an understanding of the data and systematic
reporting thereof in a manner; and also makes them amenable to further discussion, analysis,
and interpretations.

The first step in any scientific inquiry is to collect data relevant to the problem in hand. When
the inquiry relates to physical and/or biological sciences, data collection is normally an
integral part of the experiment itself. In fact, the very manner in which an experiment is
designed, determines the kind of data it would require and/or generate. The problem of
identifying the nature and the kind of the relevant data is thus automatically resolved as soon
as the design of experiment is finalized. It is possible in the case of physical sciences. In the
case of social sciences, where the required data are often collected through a questionnaire
from a number of carefully selected respondents, the problem is not that simply resolved. For
one thing, designing the questionnaire itself is a critical initial problem. For another, the
number of respondents to be accessed for data collection and the criteria for selecting them
has their own implications and importance for the quality of results obtained. Further, the data
have been collected; these are assembled, organized, and presented in the form of appropriate
tables to make them readable. Wherever needed, figures, diagrams, charts, and graphs are also
used for better presentation of the data. A useful tabular and graphic presentation of data will
require that the raw data be properly classified in accordance with the objectives of
investigation and the relational analysis to be carried out. .

A well thought-out and sharp data classification facilitates easy description of the hidden data
characteristics by means of a variety of summary measures. These include measures of central
tendency, dispersion, skewness, and kurtosis, which constitute the essential scope of
descriptive statistics. These form a large part of the subject matter of any basic textbook on
the subject, and thus they are being discussed in that order here as well.
4
Basic Statistics

Inferential statistics, also known as inductive statistics, goes beyond describing a given
problem situation by means of collecting, summarizing, and meaningfully presenting the
related data. Instead, it consists of methods that are used for drawing inferences, or making
broad generalizations, about a totality of observations on the basis of knowledge about a part
of that totality. The totality of observations about which an inference may be drawn, or a
generalization made, is called a population or a universe. The part of totality, which is
observed for data collection and analysis to gain knowledge about the population, is called a
sample.

The desired information about a given population of our interest; may also be collected even
by observing all the units comprising the population. This total coverage is called census.
Getting the desired value for the population through census is not always feasible and
practical for various reasons. Apart from time and money considerations making the census
operations prohibitive, observing each individual unit of the population with reference to any
data characteristic may at times involve even destructive testing. In such cases, obviously, the
only recourse available is to employ the partial or incomplete information gathered through a
sample for the purpose. This is precisely what inferential statistics does. Thus, obtaining a
particular value from the sample information and using it for drawing an inference about the
entire population underlies the subject matter of inferential statistics.

Consider a situation in which one is required to know the average body weight of all the
college students in a given cosmopolitan city during a certain year. A quick and easy way to
do this is to record the weight of only 500 students, from out of a total strength of, say, 10000,
or an unknown total strength, take the average, and use this average based on incomplete
weight data to represent the average body weight of all the college students. In a different
situation, one may have to repeat this exercise for some future year and use the quick estimate
of average body weight for a comparison. This may be needed, for example, to decide
whether the weight of the college students has undergone a significant change over the years
compared.

Inferential statistics helps to evaluate the risks involved in reaching inferences or


generalizations about an unknown population on the basis of sample information. For
example, an inspection of a sample of five battery cells drawn from a given lot may reveal
5
Basic Statistics

that all the five cells are in perfectly good condition. This information may be used to
conclude that the entire lot is good enough to buy or not.

Since this inference is based on the examination of a sample of limited number of cells, it is
equally likely that all the cells in the lot are not in order. It is also possible that all the items
that may be included in the sample are unsatisfactory. This may be used to conclude that the
entire lot is of unsatisfactory quality, whereas the fact may indeed be otherwise. It may, thus,
be noticed that there is always a risk of an inference about a population being incorrect when
based on the knowledge of a limited sample. The rescue in such situations lies in evaluating
such risks. For this, statistics provides the necessary methods. These centers on quantifying in
probabilistic term the chances of decisions taken on the basis of sample information being
incorrect. This requires an understanding of the what, why, and how of probability and
probability distributions to equip ourselves with methods of drawing statistical inferences and
estimating the degree of reliability of these inferences.

1.3. Stages in Statistical Investigation

There are five stages or steps in any statistical investigation.

1 Collection of data: The first and for most step to be carried in statistical analysis is data
collection. This is the step at which data about the problem to be investigated should be
collected. Data are very important elements on which the result of the analysis will depend. If
cares are not taken during data collection, that is, if wrong data are collected, then the result
which will be obtained from analyzing such wrong data will be wrong. These wrong results
will make the analyst to conclude wrongly, which in turn mislead decision makers.

 Data can be collected in a variety of ways; one of the most common methods is
through the use of survey. Survey can also be done in different methods, three of the most
common methods are:
 Interview
 Mailed questionnaire
 Interview
 Postal interview

6
Basic Statistics

Advantages:
 Can cover a large number of people or organizations
 No prior arrangements are needed
 No interviewer bias

Disadvantages:
 Little opportunity to use visual aids
 Low response rate
 Can’t reach all type of people
 Not possible to give assistance if required

B. Personal interview

Data collection through oral conversations

Advantages:

 Serious approach by respondent resulting in accurate information


 Good response rate
 Completed and immediate
 Interviewer in control and can give help if there is a problem
 Can use recording equipment
 Characteristics of respondent assessed –tone of voice, facial expression, hesitation, etc.

Disadvantages:

 Need to set up interviews


 Time consuming
 Geographic limitations
 Can be expensive
 Normally need a set of questions
 Respondent bias –tendency to please or impress, create false personal image, or end interview
quickly
 Embarrassment possible if personal questions

7
Basic Statistics

 training is required

C. Telephone interview

Advantages:

 Quick
 Can cover reasonably large numbers of people or organizations
 Wide geographic coverage
 High response rate –keep going till the required number
 No waiting
 Spontaneous response
 Help can be given to the respondent
 Can tape(tie) answers

Disadvantages:

 Often connected with selling


 Questionnaire required
 Not everyone has a telephone
 Repeat calls are inevitable(expected)
 Time is wasted
 Straightforward questions are required
 Respondent has little time to think
 Cannot use visual aids
 Can cause irritation
 Good telephone manner is required

Observation

Watching people engaged in activities and recording what occurs.

8
Basic Statistics

Advantages

 Gives relatively more accurate data on behavior and activities.


 Collection of information on facts.

Disadvantages

 Investigators or observers own bias, prejudices, desires, etc.


 Needs more resources and skill human power during the use of high-level machines.

Designing a Questionnaire

 Questions should be simple.


 Questions should be unambiguous/clear.
 The best kinds of questions are those which allow a pre-printed answer to be ticked.
 The questionnaire should be as short as possible.
 Questions should be neither irrelevant nor too personal.
 Leading questions shouldn’t be asked. A “leading question” is one that suggests the answer.
 The question “Don’t you agree that all sensible people use XYZ soap?” leads to a “Yes”
response
Objectives in designing questionnaires

There are two main objectives in designing a questionnaire:

 To maximize the proportion of subjects answering our questionnaire-that is, the response rate.
 To obtain accurate relevant information for our survey

Types of questions

Closed ended questions

 Closed-ended questions are questions that can only be answered by selecting from a limited
number of options, usually multiple-choice, 'yes' or 'no', or a rating scale (e.g. from strongly
agree to strongly disagree).
 Closed-ended questions give limited insight, but can easily be analyzed for quantitative data.

9
Basic Statistics

Example

 Sex: Male [] Female []


 Did you watch television last night? Yes [] No []

Open questions

 It allows the respondent to elaborate up on an earlier more specific question


 Often inserted at the end of major sections, or at the end of the questionnaire
 It should not be used to introduce a section since there is a high risk of influencing later
responses
 It involves intensive summarization and possibly coding

Exercise: discuss the advantage and disadvantage of the above three methods with respect to
each other.

2 Organization of data: Once data have been collected, the next step will be arranging the bulk
of data into simple and understandable manner. This includes classifying the data according to
their resemblance, and arranging data into tabular form by putting records (data) in rows and
columns., e.g table form
3 Presentation of the data: After data have been collected and organized, the following step will
be presentation. That means, we can transform the data into charts, graphs or we can present
using frequency distributions.

4. Analysis of data: This step is a step where the presented data will be investigated using
different methods of statistical techniques. Among different methods of analysing data, we
can mention some of the simple descriptive analysis such as dealing with measures of central
tendencies, measures of variations and so on.
5. Inference of data: The final step to take place while conducting statistical investigation is
interpreting the results obtained preceding steps. This includes, giving appropriate
conclusions based on the obtained values from the data. That means, we need to transform the
numerical results computed from the gathered data into statements regarding the problem
under investigation so that decision makers will easily understand and make decisions based
on the drawn conclusions.
10
Basic Statistics

 Statistical techniques based on probability theory are required.

1.4. Some Basic Definitions

a. Statistical Population: It is the collection of all possible observations of a specified


characteristic of interest (possessing certain common property) and being under study. An
example is all of the students in AAU 3101 course in this term.
b. Sample: It is a subset of the population, selected using some sampling technique in such a
way that they represent the population.
c. Sampling: The process or method of sample selection from the population.
d. Sample size: The number of elements or observation to be included in the sample.
e. Census: Complete enumeration or observation of the elements of the population. Or it is the
collection of data from every element in a population
f. Parameter: Characteristic or measure obtained from a population.
g. Statistic: Characteristic or measure obtained from a sample.
h. Variable: It is an item of interest that can take on many different numerical

1.5. Types of variables, Data and scale of measurements


1.5.1. Types of variables

Statistical data are the basic raw material of statistics. Data may relate to an activity of our
interest, a phenomenon, or a problem situation under study. They derive as a result of the
process of measuring, counting and/or observing. Statistical data, therefore, refer to those
aspects of a problem situation that can be measured, quantified, counted, or classified. Any
object subject phenomenon, or activity that generates data through this process is termed as a
variable.

In other words, a variable is one that shows a degree of variability when successive
measurements are recorded. In statistics, variables are classified into two broad categories:
quantitative data and qualitative variables. This classification is based on the kind of
characteristics that are measured.

i) Quantitative variables are those that can be quantified in definite units of measurement. These
refer to characteristics whose successive measurements yield quantifiable observations.
11
Basic Statistics

Depending on the nature of the variable observed for measurement, quantitative variables can
be further categorized as continuous and discrete variables.

Obviously, a quantitative variable may be a continuous variable or a discrete variable.

a) Continuous variables represent the numerical values of a continuous variable. A continuous


variable is the one that can assume any value between any two points on a line segment, thus
representing an interval of values. The values are quite precise and close to each other, yet
distinguishably different. All characteristics such as weight, length, height, thickness,
velocity, temperature, tensile strength, etc., represent continuous variables. Thus, the data
recorded on these and similar other characteristics are called continuous data. It may be noted
that a continuous variable assumes the finest unit of measurement. Finest in the sense that it
enables measurements to the maximum degree of precision.
b) Discrete variables are the values assumed by a discrete variable. A discrete variable is the
one whose outcomes are measured in fixed numbers. Such data are essentially count data.
These are derived from a process of counting, such as the number of items possessing or not
possessing a certain characteristic. The number of customers visiting a departmental store
every day, the incoming flights at an airport, and the defective items in a consignment
received for sale, are all examples of discrete data.
ii) Qualitative variables refer to qualitative characteristics of a subject or an object. A characteristic is
qualitative in nature when its observations are defined and noted in terms of the presence or
absence of a certain attribute in discrete numbers. These variables are further classified as
nominal and rank variables.
a) Nominal variables are the outcome of classification into two or more categories of items or
units comprising a sample or a population according to some quality characteristic.
Classification of students according to sex (as males and females), of workers according to
skill (as skilled, semi-skilled, and unskilled), and of employees according to the level of
education (as matriculates, undergraduates, and post-graduates), all result into nominal data.
Given any such basis of classification, it is always possible to assign each item to a particular
class and make a summation of items belonging to each class. The count data so obtained are
called nominal data.

12
Basic Statistics

b) Rank variables, on the other hand, are the result of assigning ranks to specify order in terms
of the integers 1, 2, 3... n. Ranks may be assigned according to the level of performance in a
test. a contest, a competition, an interview, or a show. The candidates appearing in an
interview, for example, may be assigned ranks in integers ranging from I to n, depending on
their performance in the interview. Ranks so assigned can be viewed as the continuous values
of a variable involving performance as the quality characteristic.

1.6. Types and source of data

Data sources could be seen as of two types, viz., secondary and primary.
Depending on the source data’s can be classified as:
(i)Secondary data: They already exist in some form: published or unpublished - in an identifiable
secondary source. They are, generally, available from published source(s), though not
necessarily in the form actually required.
(ii) Primary data: Those data which do not already exist in any form, and thus have to be
collected for the first time from the primary source(s). By their very nature, these data require
fresh and first-time collection covering the whole population or a sample drawn from it.

1.7. Measurement scale and scale types

Proper knowledge about the nature and type of data to be dealt with is essential in order to
specify and apply the proper statistical method for their analysis and inferences. Measurement
scale refers to the property of value assigned to the data based on the properties of order,
distance and fixed zero.

In mathematical terms measurement is a functional mapping from the set of objects {Oi} to
the set of real numbers {M(Oi)}.

13
Basic Statistics

The goal of measurement systems is to structure the rule for assigning numbers to objects in
such a way that the relationship between the objects is preserved in the numbers assigned to
the objects. The different kinds of relationships preserved are called properties of the
measurement system.

Order

The property of order exists when an object that has more of the attribute than another object,
is given a bigger number by the rule system. This relationship must hold for all objects in the
"real world".

The property of ORDER exists

When for all i, j if Oi > Oj, then M(Oi) > M(Oj).

Distance

The property of distance is concerned with the relationship of differences between objects. If
a measurement system possesses the property of distance it means that the unit of
measurement means the same thing throughout the scale of numbers. That is, an inch is an
inch, no matters were it falls - immediately ahead or a mile downs the road.

More precisely, an equal difference between two numbers reflects an equal difference in the
"real world" between the objects that were assigned the numbers. In order to define the
14
Basic Statistics

property of distance in the mathematical notation, four objects are required: Oi, Oj, Ok, and Ol
. The difference between objects is represented by the "-" sign; Oi - Oj refers to the actual
"real world" difference between object i and object j, while M(Oi) - M(Oj) refers to
differences between numbers.

The property of DISTANCE exists, for all i, j, k, l

If Oi-Oj ≥ Ok- Ol then M(Oi)-M(Oj) ≥ M(Ok)-M( Ol ).

Fixed Zero

A measurement system possesses a rational zero (fixed zero) if an object that has none of the
attribute in question is assigned the number zero by the system of rules. The object does not
need to really exist in the "real world", as it is somewhat difficult to visualize a "man with no
height". The requirement for a rational zero is this: if objects with none of the attribute did
exist would they be given the value zero. Defining O0 as the object with none of the attribute
in question, the definition of a rational zero becomes:

The property of FIXED ZERO exists if M(O0) = 0.

The property of fixed zero is necessary for ratios between numbers to be meaningful.

1.8. Types of Measurement Scales

Measurement is the assignment of numbers to objects or events in a systematic fashion. Four


levels of measurement scales are commonly distinguished: nominal, ordinal, interval, and
ratio and each possessed different properties of measurement systems.

Nominal Scales

Nominal scales are measurement systems that possess none of the three properties stated
above.

Level of measurement which classifies data into mutually exclusive, all-inclusive categories
in which no order or ranking can be imposed on the data.

15
Basic Statistics

No arithmetic and relational operation can be applied.


Examples:
Political party preference (Republican, Democrat, or Other,)
Sex (Male or Female.)
Marital status (married, single, widow, divorce)
Country code
Regional differentiation of Ethiopia.

Ordinal Scales

Ordinal Scales are measurement systems that possess the property of order, but not the
property of distance. The property of fixed zero is not important if the property of distance is
not satisfied.

1. Level of measurement which classifies data into categories that can be ranked. Differences
between the ranks do not exist.

Arithmetic operations are not applicable but relational operations are applicable.

Ordering is the sole property of ordinal scale.

Examples:
Letter grades (A, B, C, D, F).
Rating scales (Excellent, very good, Good, Fair, poor).
Military status.

Interval Scales

Interval scales are measurement systems that possess the properties of Order and distance, but
not the property of fixed zero. Level of measurement which classifies data that can be ranked
and differences are meaningful. However, there is no meaningful zero, so ratios are
meaningless.

16
Basic Statistics

All arithmetic operations except division are applicable.


Relational operations are also possible.
Examples:
IQ, Temperature in oF.

Ratio Scales

Ratio scales are measurement systems that possess all three properties: order, distance, and
fixed zero. The added power of a fixed zero allows ratios of numbers to be meaningfully
interpreted; i.e. the ratio of Bekele's height to Martha's height is 1.32, whereas this is not
possible with interval scales.

Level of measurement which classifies data that can be ranked, differences are meaningful,
and there is a true zero. True ratios exist between the different units of measure.

All arithmetic and relational operations are applicable.

Examples:

Weight
Height
Number of students
Age

17
Basic Statistics

EXERCISE

The following present a list of different attributes and rules for assigning numbers to objects.
Try to classify the different measurement systems into one of the four types of scales.

1. Your checking account number as a name for your account.


2. Your checking account balance as a measure of the amount of money you have in that
account.
3. Your score on the first statistics test as a measure of your knowledge of statistics.
4. Your score on an individual intelligence test as a measure of your intelligence.
5. The distance around your forehead measured with a tape measure as a measure of your
intelligence.
6. A response to the statement "Abortion is a woman's right" where "Strongly Disagree" = 1,
"Disagree" = 2, "No Opinion" = 3, "Agree" = 4, and "Strongly Agree" = 5, as a measure of
attitude toward abortion.
7. Times for swimmers to complete a 50-meter race
8. Months of the year Meskerm, Tikimit…
9. Blood type of individuals, A, B, AB and O.
10. Pollen counts provided as numbers between 1 and 10 where 1 implies there is almost no
pollen and 10 that it is rampant, but for which the values do not represent an actual count of
grains of pollen.
11. Regions numbers of Ethiopia (1, 2, 3 etc.)
12. The number of students in a college;
13. the net wages of a group of workers;
14. the height of the men in the same town;

1.9. Statistics and its importance in business decisions

There are three major functions in any business enterprise in which the statistical methods are
useful. These are as follows:
18
Basic Statistics

(i) The planning of operations: This may relate to either special projects or to the recurring activities
of a firm over a specified period.
(ii) The setting up of standards: This may relate to the size of employment, volume of sales,
fixation of quality norms for the manufactured product, norms for the daily output, and so
forth.
(iii) The function of control: This involves comparison of actual production achieved against the
norm or target set earlier. In case the production has fallen short of the target, it gives
remedial measures so that such a deficiency does not occur again.

A worth noting point is that although these three functions-planning of operations, setting
standards, and control-are separate, but in practice they are very much interrelated.

Different authors have highlighted the importance of Statistics in business. For instance,
Croxton and Cowden give numerous uses of Statistics in business such as project planning,
budgetary planning and control, inventory planning and control, quality control, marketing,
production and personnel administration. Within these also they have specified certain areas
where Statistics is very relevant. Another author, Irwing W. Burr, dealing with the place of
statistics in an industrial organization, specifies a number of areas where statistics is
extremely useful. These are: customer wants and market research, development design and
specification, purchasing, production, inspection, packaging and shipping, sales and
complaints, inventory and maintenance, costs, management control, industrial engineering
and research.

Statistical problems arising in the course of business operations are multitudinous. As such,
one may do no more than highlight some of the more important ones to emphasis the
relevance of statistics to the business world. In the sphere of production, for example,
statistics can be useful in various ways.

Statistical quality control methods are used to ensure the production of quality goods.
Identifying and rejecting defective or substandard goods achieve this. The sale targets can be
fixed on the basis of sale forecasts, which are done by using varying methods of forecasting.
Analysis of sales affected against the targets set earlier would indicate the deficiency in
achievement, which may be on account of several causes:

19
Basic Statistics

(i) targets were too high and unrealistic


(ii) salesmen's performance has been poor
(iii) emergence of increase in competition
(iv) Poor quality of company's product, and so on. These factors can be further investigated.

Another sphere in business where statistical methods can be used is personnel management.
Here, one is concerned with the fixation of wage rates, incentive norms and performance
appraisal of individual employee. The concept of productivity is very relevant here. On the
basis of measurement of productivity, the productivity bonus is awarded to the workers.
Comparisons of wages and productivity are undertaken in order to ensure increases in
industrial productivity.

Statistical methods could also be used to ascertain the efficacy of a certain product, say,
medicine. For example, a pharmaceutical company has developed a new medicine in the
treatment of bronchial asthma. Before launching it on commercial basis, it wants to ascertain
the effectiveness of this medicine. It undertakes an experimentation involving the formation
of two comparable groups of asthma patients. One group is given this new medicine for a
specified period and the other one is treated with the usual medicines. Records are maintained
for the two groups for the specified period. This record is then analysed to ascertain if there is
any significant difference in the recovery of the two groups. If the difference is really
significant statistically, the new medicine is commercially launched.

1.10. Scope of statistics

Apart from the methods comprising the scope of descriptive and inferential branches of
statistics, statistics also consists of methods of dealing with a few other issues of specific
nature. Since these methods are essentially descriptive in nature, they have been discussed
here as part of the descriptive statistics. These are mainly concerned with the following:

(i)It often becomes necessary to examine how two paired data sets are related. For example, we may
have data on the sales of a product and the expenditure incurred on its advertisement for a
specified number of years. Given that sales and advertisement expenditure are related to each
other, it is useful to examine the nature of relationship between the two and quantify the

20
Basic Statistics

degree of that relationship. As this requires use of appropriate statistical methods, these falls
under the purview of what we call regression and correlation analysis.
(ii) Situations occur quite often when we require averaging (or totalling) of data on prices and/or
quantities expressed in different units of measurement. For example, price of cloth may be
quoted per meter of length and that of wheat per kilogram of weight. Since ordinary methods
of totalling and averaging do not apply to such price/quantity data, special techniques needed
for the purpose are developed under index numbers.
(iii) Many a time, it becomes necessary to examine the past performance of an activity with a view
to determining its future behaviour. For example, when engaged in the production of a
commodity, monthly product sales are an important measure of evaluating performance. This
requires compilation and analysis of relevant sales data over time. The more complex the
activity, the more varied the data requirements. For profit maximizing and future sales
planning, forecast of likely sales growth rate is crucial. This needs careful collection and
analysis of past sales data. All such concerns are taken care of under time series analysis.
(iv) Obtaining the most likely future estimates on any aspect(s) relating to a business or economic
activity has indeed been engaging the minds of all concerned. This is particularly important
when it relates to product sales and demand, which serve the necessary basis of production
scheduling and planning. The regression, correlation, and time series analyses together help
develop the basic methodology to do the needful. Thus, the study of methods and techniques
of obtaining the likely estimates on business/economic variables comprises the scope of what
we do under business forecasting.

Keeping in view the importance of inferential statistics, the scope of statistics may finally be
restated as consisting of statistical methods which facilitate decision--making under
conditions of uncertainty. While the term statistical methods is often used to cover the subject
of statistics as a whole, in particular it refers to methods by which statistical data are analysed,
interpreted, and the inferences drawn for decision-making.

Though generic in nature and versatile in their applications, statistical methods have come to
be widely used, especially in all matters concerning business and economics. These are also
being increasingly used in biology, medicine, agriculture, psychology, and education. The
scope of application of these methods has started opening and expanding in a number of

21
Basic Statistics

social science disciplines as well. Even a political scientist finds them of increasing relevance
for examining the political behavior and it is, of course, no surprise to find even historians
statistical data, for history is essentially past

1.11. Limitations of statistics

Statistics has a number of limitations, pertinent among them are as follows:

(i) There are certain phenomena or concepts where statistics cannot be used. This is because these
phenomena or concepts are not amenable to measurement. For example, beauty, intelligence,
courage cannot be quantified. Statistics has no place in all such cases where quantification is
not possible.
(ii) Statistics reveal the average behaviour, the normal or the general trend. An application of the
'average' concept if applied to an individual or a particular situation may lead to a wrong
conclusion and sometimes may be disastrous. For example, one may be misguided when told
that the average depth of a river from one bank to the other is four feet, when there may be
some points in between where its depth is far more than four feet. On this understanding, one
may enter those points having greater depth, which may be hazardous.
(iii) Since statistics are collected for a particular purpose, such data may not be relevant or useful
in other situations or cases. For example, secondary data (i.e., data originally collected by
someone else) may not be useful for the other person.
(iv) Statistics are not 100 per cent precise as is Mathematics or Accountancy. Those who use
statistics should be aware of this limitation.
(v) In statistical surveys, sampling is generally used as it is not physically possible to cover all the
units or elements comprising the universe. The results may not be appropriate as far as the
universe is concerned. Moreover, different surveys based on the same size of sample but
different sample units may yield different results.
(vi) At times, association or relationship between two or more variables is studied in statistics, but
such a relationship does not indicate cause and effect' relationship. It simply shows the
similarity or dissimilarity in the movement of the two variables. In such cases, it is the user
who has to interpret the results carefully, pointing out the type of relationship obtained.
(vii) A major limitation of statistics is that it does not reveal all pertaining to a certain
phenomenon. There is some background information that statistics does not cover. Similarly,
22
Basic Statistics

there are some other aspects related to the problem on hand, which are also not covered. The
user of Statistics has to be well informed and should interpret Statistics keeping in mind all
other aspects having relevance on the given problem.

Apart from the limitations of statistics mentioned above, there are misuses of it. Many people,
knowingly or unknowingly, use statistical data in wrong manner. Let us see what the main
misuses of statistics are so that the same could be avoided when one has to use statistical data.
The misuse of Statistics may take several forms some of which are explained below.

(i) Sources of data not given: At times, the source of data is not given. In the absence of the source,
the reader does not know how far the data are reliable. Further, if he wants to refer to the
original source, he is unable to do so.
(ii) Defective data: Another misuse is that sometimes one gives defective data. This may be done
knowingly in order to defend one's position or to prove a particular point. This apart, the
definition used to denote a certain phenomenon may be defective. For example, in case of
data relating to unemployed persons, the definition may include even those who are
employed, though partially. The question here is how far it is justified to include partially
employed persons amongst unemployed ones.
(iii) Unrepresentative sample: In statistics, several times one has to conduct a survey, which
necessitates to choose a sample from the given population or universe. The sample may turn
out to be unrepresentative of the universe. One may choose a sample just on the basis of
convenience. He may collect the desired information from either his friends or nearby
respondents in his neighborhood even though such respondents do not constitute a
representative sample.
(iv) Inadequate sample: Earlier, we have seen that a sample that is unrepresentative of the universe
is a major misuse of statistics. This apart, at times one may conduct a survey based on an
extremely inadequate sample. For example, in a city we may find that there are 1, 00,000
households. When we have to conduct a household survey, we may take a sample of merely
100 households comprising only 0.1 per cent of the universe. A survey based on such a small
sample may not yield right information.
(v) Unfair Comparisons: An important misuse of statistics is making unfair comparisons from the data
collected. For instance, one may construct an index of production choosing the base year

23
Basic Statistics

where the production was much less. Then he may compare the subsequent year's production
from this low base.

Such a comparison will undoubtedly give a rosy picture of the production though in reality it
is not so. Another source of unfair comparisons could be when one makes absolute
comparisons instead of relative ones. An absolute comparison of two figures, say, of
production or export, may show a good increase, but in relative terms it may turn out to be
very negligible. Another example of unfair comparison is when the population in two cities is
different, but a comparison of overall death rates and deaths by a particular disease is
attempted. Such a comparison is wrong. Likewise, when data are not properly classified or
when changes in the composition of population in the two years are not taken into
consideration, comparisons of such data would be unfair as they would lead to misleading
conclusions.

(vi) Unwanted conclusions: Another misuse of statistics may be on account of unwarranted


conclusions. This may be as a result of making false assumptions. For example, while making
projections of population in the next five years, one may assume a lower rate of growth
though the past two years indicate otherwise. Sometimes one may not be sure about the
changes in business environment in the near future. In such a case, one may use an
assumption that may turn out to be wrong. Another source of unwarranted conclusion may be
the use of wrong average. Suppose in a series there are extreme values, one is too high while
the other is too low, such as 800 and 50. The use of an arithmetic average in such a case may
give a wrong idea. Instead, harmonic mean would be proper in such a case.
(vii) Confusion of correlation and causation: In statistics, several times one has to examine the
relationship between two variables. A close relationship between the two variables may not
establish a cause-and-effect-relationship in the sense that one variable is the cause and the
other is the effect. It should be taken as something that measures degree of association rather
than try to find out causal relationship.

1.12. Importance of Statistics in Business

There are three major functions in any business enterprise in which the statistical methods are
useful. These are as follows:

24
Basic Statistics

(i) The planning of operations: This may relate to either special projects or to the recurring activities
of a firm over a specified period.
(ii) The setting up of standards: This may relate to the size of employment, volume of sales,
fixation of quality norms for the manufactured product, norms for the daily output, and so
forth.
(iii) The function of control: This involves comparison of actual production achieved against the
norm or target set earlier. In case the production has fallen short of the target, it gives
remedial measures so that such a deficiency does not occur again.

A worth noting point is that although these three functions-planning of operations, setting
standards, and control-are separate, but in practice they are very much interrelated. Different
authors have highlighted the importance of Statistics in business. For instance, Croxton and
Cowden give numerous uses of Statistics in business such as project planning, budgetary
planning and control, inventory planning and control, quality control, marketing, production
and personnel administration. Within these also they have specified certain areas where
Statistics is very relevant. Another author, Irwing W. Burr, dealing with the place of statistics
in an industrial organization, specifies a number of areas where statistics is extremely useful.
These are: customer wants and market research, development design and specification,
purchasing, production, inspection, packaging and shipping, sales and complaints, inventory
and maintenance, costs, management control, industrial engineering and research.

Statistical problems arising in the course of business operations are multitudinous. As such,
one may do no more than highlight some of the more important ones to emphasis the
relevance of statistics to the business world. In the sphere of production, for example,
statistics can be useful in various ways. Statistical quality control methods are used to ensure
the production of quality goods. Identifying and rejecting defective or substandard goods
achieve this. The sale targets can be fixed on the basis of sale forecasts, which are done by
using varying methods of forecasting. Analysis of sales affected against the targets set earlier
would indicate the deficiency in achievement, which may be on account of several causes: (i)
targets were too high and unrealistic (ii) salesmen's performance has been poor (iii)
emergence of increase in competition (iv) poor quality of company's product, and so on.
These factors can be further investigated.

25
Basic Statistics

Another sphere in business where statistical methods can be used is personnel management.
Here, one is concerned with the fixation of wage rates, incentive norms and performance
appraisal of individual employee. The concept of productivity is very relevant here. On the
basis of measurement of productivity, the productivity bonus is awarded to the workers.
Comparisons of wages and productivity are undertaken in order to ensure increases in
industrial productivity. Statistical methods could also be used to ascertain the efficacy of a
certain product, say, medicine. For example, a pharmaceutical company has developed a new
medicine in the treatment of bronchial asthma. Before launching it on commercial basis, it
wants to ascertain the effectiveness of this medicine. It undertakes an experimentation
involving the formation of two comparable groups of asthma patients. One group is given this
new medicine for a specified period and the other one is treated with the usual medicines.
Records are maintained for the two groups for the specified period. This record is then
analyzed to ascertain if there is any significant difference in the recovery of the two groups. If
the difference is really significant statistically, the new medicine is commercially launched.

1.5 Summary

In a summarized manner, ‘Statistics’ means numerical information expressed in quantitative


terms. As a matter of fact, data have no limits as to their reference, coverage, and scope. At
the macro level, these are data on gross national product and shares of agriculture,
manufacturing, and services in GDP (Gross Domestic Product). At the micro level, individual
firms, howsoever small or large, produce extensive statistics on their operations. The annual
reports of companies contain variety of data on sales, production, expenditure, inventories,
capital employed, and other activities. These data are often field data, collected by employing
scientific survey techniques. Unless regularly updated, such data are the product of a one-time
effort and have limited use beyond the situation that may have called for their collection. A
student knows statistics more intimately as a subject of study like economics, mathematics,
chemistry, physics, and others. It is a discipline, which scientifically deals with data, and is
often described as the science of data. In dealing with statistics as data, statistics has
developed appropriate methods of collecting, presenting, summarizing, and analyzing data,
and thus consists of a body of these methods.

26
Basic Statistics

1.8 Self-test questions

1. Define Statistics. Explain its types, and importance to trade, commerce and business.

2. “Statistics is all-pervading”. Elucidate this statement.

3. Write a note on the scope and limitations of Statistics.

4. Distinguish between descriptive Statistics and inferential Statistics.

27
Basic Statistics

CHAPTER TWO: VISUAL DESCRIPTION OF DATA

Introduction
The first step in any statistical investigation is collecting relevant data. After the data has been
collected they have to be organized and presented in a systematic manner. Common methods of
presenting numerical data are frequency distributions, diagrammatic and graphical methods.
Frequency distributions are of different types. These are: ungrouped or discrete, qualitative,
grouped or continuous, relative, and cumulative frequency distributions.
Based on the frequency distributions mentioned above we can present our data using different
diagrams, graphs, stem and leaf and dot plot. The different types of diagrams will be discussed
are bar charts (simple, component, percentage component, multiple, and pie chart). We have
different types of graphs for presentation of data such as histogram, frequency polygon,
cumulative frequency curves (ogive curves), line graph and vertical line graph.
Learning Outcomes
At the end of this chapter students will be able to:
 Construct relative and cumulative frequency distribution for raw data
 Identify class marks, class width and class boundaries
 Present of numerical data using graphs and charts
 Identify between categorical and continuous frequency distributions
 Follow the principles to be followed for constructing frequency distributions
 Identify between the 'less than' and 'more than' cumulative frequency distributions
 Distinguish between the different types of diagrams
 Distinguish between the different types of graphs
 Construct different graphs, diagrams, stem and leaf and dot plots for a given data set

2.1. Data Type and Methods of Data Collection

Basic Concepts

In any statistical investigation the first step is to collect a set of related observations (data)
from which conclusions may be drawn.

28
Basic Statistics

Data: are a set of related information (facts) from which statistical conclusion may be drawn
or data are a real value of the variable.

Variable: It is a characteristic that can assume different values. Based on information desired
variables can be classified as qualitative and quantitative. Qualitative Data are data which are
non-numeric in nature and can’t be measured. A qualitative data is a data that cannot be
described numerically.

Examples: Number of employees, Soil type of a fruit farm, Sex of a patient, eye color of
Ostrich, Gender, Religion, type of Sport (football, athletics, ...) etc.

Quantitative Data: are data that can be expressed numerically or are data that are numeric in
nature. Quantitative data can be further classified as discrete or continuous.

a. Discrete Data: A data that assumes a finite or countable number of possible values. Discrete
data are usually obtained by counting.
Example: Number of customers, Number of tourists in Jimma, number of children in a family,
etc.

b. Continuous Data: A data that can theoretically assume infinite number of possible values.
Continuous data are obtained by measuring.
Examples: amount of money in a certain account, Yield of wheat from certain farm, area of
crop land in m2 etc.

The nature of data we obtain depends on the nature of the study and the population on the
characteristics in interest. Due to this reason, we have different types of data under different
basis of classification.

A. Classifications by Sources

The statistical data may be classified under two categories, depending upon the sources. These
are Primary and - Secondary data.

Primary Data: are those data, which are collected by the investigator himself for the purpose
of a specific inquiry or study. Such data are original in character and are mostly generated by

29
Basic Statistics

surveys conducted by individuals or research institutions. Primary data gives first-hand


information, more reliable and accurate and are original in character.

Primary method of data collection consists of obtaining data or information by any of the
following methods

 Observation: is a technique that involves systematically selecting, watching and recoding


behaviors of people or other phenomena and aspects of the setting in which they occur, for the
purpose of getting (gaining) specified information.

 Direct personal Interview: is a conversation between two people that is initiated by the
interviewer (researcher) in order to obtain the required information. The interviewer (usually
the investigator) sets series of questions directly related to his work in advance and conducts
the interview. Tape records and other necessary materials might be taken with the interviewer.
 Mailed Questionnaires: questionnaires are sent by post to the informants together with a
polite covering letter and they return to back with answers for the researcher.
 Self-administered questionnaires: is a method of data collection in which researcher’s give
well organized questionnaire directly to the respondents.

Secondary Data: When an investigator uses data, which have already been collected by
others, such data are called "Secondary Data". Data are primary data for the agency that
collected them, and become secondary for someone else who uses these data for his own
purposes. Secondary data can be obtained from journals, reports, government and non-
government publications, publications of professionals and research organizations (in general
Published and unpublished sources).

Secondary data are less expensive to collect both in money and time while in most cases,
however, secondary data must be used with utmost care because:

 They may be full of errors.


 The primary objective of collecting them might be different from the purpose of the user of
the data.
 There may have been bias introduced while collecting them.
 The size of the sample may have been inadequate.

30
Basic Statistics

B. Classification by the Role of Time

According to the role of time, data are classified in to cross-section and time series data.

Cross-section data is a set of observations taken at one point in time.

Time series data is a set of observations collected for a sequence of times, usually at equal
interval which may be on weekly, monthly, quarterly, yearly, etc basis.

2.2. Methods of data collection

Source of Data

Essentially, we have two categories of data, namely primary data and secondary data.

Primary data: are data which are collected from the units or individual respondents directly
for the purpose of certain study or information.

 These data are original in character, collected to meet the specific problem needs at hand.
 Is collected by immediate users of the data, for the first time.

Example:

 Data obtained from censuses


 Data collected directly from individuals

Secondary data: are data which are taken from the records of institutions that collect and
publish statistics as part of their routine duties. These are already existing which has
previously been collected and reported by some individual organization for their own purpose
and at latter stage some of the data will come to be made available to other individuals or
organization

Example:

 Data that could be taken from census reports


 Data that could be taken from any books, newspapers, magazines and the likes
31
Basic Statistics

Methods of data collection

Data collection is the first task to be carried out in statistical analysis. There are two methods
of data collection, namely primary and secondary methods of data collection.

The primary method consists of obtaining data or information by any of the following ways:

a) Direct personal observation

It is a technique that involves systematically selecting, watching, and recording behaviors of


people or other phenomena and aspects of the setting in which they occur, for the purpose of
getting specified information. It includes all methods from simple visual observations to the
use of high-level machines and measurements, sophisticated equipment’s or facilities, such as
radiographic, biochemical, x-ray machines, microscope, clinical examinations, and
microbiological examinations. It is most commonly used studies relating to behavioral
science.

Advantage: Gives relatively more accurate data on behavior and activities. The method is
independent of respondent’s willingness to respond.

Disadvantages: Investigator’s or observer’s own bias, prejudice, desires, and etc. and needs
more resources and skilled human power during the use of high-level machines. Information
provided by this method is very limited because unforeseen factors may interfere with the
observational facts.

b) Personal interview

This involves presentation of oral verbal stimuli and reply in terms of oral verbal response.

These are some commonly used data collection techniques. Therefore, designing a good tool
which will serve for collecting information is a vital task that requires due attention while
developing research proposals.

32
Basic Statistics

Studies with many respondents often use shorter, highly structured questionnaires, whereas
smaller studies allow more flexibility and may use questionnaires with a number of open-
ended questions.

Once the decision has been made interviews may be less or more structured. Unstructured
interview is flexible, the content, wording and order of the questions vary from interview to
interview.

Standardized methods (where the wording and order of the questions are decided in advance)
of asking questions are usually preferred in community research, since they provide more
assurance that the data will be reproducible.

Less structured interviews may be useful in

 Preliminary survey, and


 intensive studies of perceptions, attitudes, motivation and affective reactions
c) Self-administered questionnaire

Self-administered questionnaires are types of questionnaires which more or less could be


conducted through the following.

 They are simple in nature and cheaper in cost


 They can be administered to many persons simultaneously
 They can be administered by mails (unlike interviews), but this demands certain level of
education and skill from the side of the respondents
 People of low socio-economic status are less likely to respond to a mailed questionnaire.
d) Information from correspondents

There is also another method of data collection where selected correspondents take part. This
is a type of data collection in which those correspondents are used to collect data according to
the guidelines given by the survey conducting institutions.

33
Basic Statistics

e) Mailed Questionnaire Method

Under this method, the investigator prepares a questionnaire containing a number of questions
pertaining to the field of inquiry. The questions are sent by mails to the informants together
with a covering letter requesting the respondents to cooperate on giving the correct responses
and returning back the filed in questionnaire, and explaining such details as the objectives of
the data to be collected, description as to how the questionnaire should be filed in, how
important the responses of every selected respondent is, and so on.

Types of questionnaires

Open ended questionnaire: -permits free response that should be recorded by the
respondents own word. This type of questionnaire is use full to obtain information on
sensitive issues, opinions and facts not very familiar to the researcher.

Close ended question: -offers a list of options from which the respondents choose. This
question has only two possible answers (Yes/ No true or false).

Multiple choice questions: -The respondent is governed by the choice and selects one of the
alternative possible answers.

In designing a questionnaire

 The question must be simple.


 The sequence should be kept.
 It should be designed in a way that the answers can be cross checked.
 The questions should not be ambiguous
 Leading questions should be avoided.
 Sensitive topics should be avoided or at the end.

Merits

 free from the bias of the interviewer


 Respondents who are not easily approachable can also be reached conveniently.

34
Basic Statistics

Demerits

 Low rate of return the dully filled questionnaire


 can be only used with literate respondents
 Control over questionnaire may lose.

The data obtained through these methods are called primary data, as explained above.

The other method of data collection, secondary method, is a method by which we collect data
from secondary sources such as administrative records, books, survey results and so on. Data
collected through such methods are known as secondary data, as has been discussed above.

2.3. Method of Data presentation

The Frequency Distribution and the Histogram

In the previous chapter you have been introduced to the definition, applications uses and
limitations of statistics as well as the method of data collection and sampling from population.
i.e., introduction. Once the data have been collected, we will have a mass of raw data. Our
mind cannot readily grasp the overall content of such a mass data. Hence, we have to
condense or summaries them in the form of tables and graphs. This chapter deals with
presentation of numerical data using tables, diagrams and graphs.

Frequency distribution

Sorting of data into categories or classes will lead to formation of frequency distributions.
Frequency distribution gives the number of times a category or class occurs. The term
frequency is used to denote the number of times a category or class occurs. The frequency is
denoted by f and the class is denoted by X.

Statistical tables

Data can also be provided in statistical tables where the major advantages of such tables are:

i. Tabulated data can be easily understood than facts given in the form of description
ii. They facilitate quick comparisons
35
Basic Statistics

iii. Statistical tables make the summation of items and detection of errors and omissions easier
iv. When data are tabulated, all unnecessary details and repetitions are avoided.

Definition:

Raw data: recorded information in its original collected form, whether it is counts or
measurements, is referred to as raw data.

Frequency: is the number of values in a specific class of the distribution.

Frequency distribution: is the organization of raw data in table form using classes and
frequencies.

Types of frequency distributions

There are three basic types of frequency distributions

 Categorical frequency distribution


 Ungrouped frequency distribution
 Grouped frequency distribution

There are specific procedures for constructing each type.

1. Ungrouped frequency distribution (UFD)

It Shows a distribution where the values of a variable are linked with respective frequencies
and mostly used for small data set. Discrete frequency distribution is one, which involves a
discrete variable.

Steps in constructing a discrete frequency distribution.

1. Make sure that the variable you have is discrete


2. Determine the possible values of the variable
3. Prepare an array for the distribution of the variable

36
Basic Statistics

4. Prepare three columns, the first for the different values of the variable, the second for tally
marks to facilitate the counting, the third for the frequency corresponding to each value of the
variable

Values Tally Frequency


Marks

5. Write the possible values of the variable in ascending order in the first column.

Example 2.1: The following data represents the number of books read in the past six months
by each student in a class of 25.

6 24 14 11 33

15 15 8 14 10

8 27 15 6 20

20 9 33 15 10

6 11 20 8 6

Construct a frequency distribution for this data.

Solution: These individual observations can be arranged in an ascending or descending order


of magnitude in which case the series is called an “array”

Array of number of books read in the past six months by each student in a class of 25.

6 6 6 6 8 8 8 9 10 10 11 11

14 14 15 15 15 15 20 20 20 24 27 33

33

37
Basic Statistics

Since the variable “Number of books read” can assume only the values 0,1,2,3,4,5,6…,
(which are whole numbers) it is a discrete variable.

Therefore, its frequency distribution is a discrete frequency distribution.

Number of books Number of students (f)


Read(X)
4 6
8 3
9 1
10 2
11 2

4
3
1

Total 25

2. Categorical frequency Distribution

Used for data that can be placed in to specific categories such as nominal or ordinal e.g.,
marital status.

Step1: Make a table as shown.

38
Basic Statistics

Class Tally Frequency Percent

(1) (2) (3) (4)

Step 2: step 2: Tally the data and place the result in column (2).

Step 3: Count the tally and place the result in column (3).

Step 4: Find the percentages of values in each class by using;

f
% *100 Where f= frequency of the class, n=total number of values.
n

Percentages are not normally a part of frequency distribution but they can be added since they
are used in certain types diagrammatic such as pie charts.

Step 5: Find the total for column (3) and (4).

Example 2.2: A social worker collected the following data on marital status for 25 persons.

(M= married, S = Single, W= widowed, D= divorced)

M S D W D

S S M M M

W D S M M

W D D S S

39
Basic Statistics

S W W D D

Construct a frequency distribution for the above data.

Solution: Make a table as shown below

Class Tally frequency percent

M |||| 5 20

S |||||| 7 28

D |||||| 7 28

W ||||| 6 24

3. Continuous frequency distribution (Grouped frequency distribution)

Continuous frequency distribution arises from continuous variables. i.e.., from


measurements done on continuous scales like height, weight, amount of power supply, etc.
unlike that of a discrete frequency distribution where one class is used for each value of a
continuous variable. For otherwise the purpose of classification i.e. Condensation of the data
will be lost. Hence the observations each item falls in one and only one group (class) and the
classes would be exhaustive.

Example 2.3: The following table shows the frequency distribution of the test results of 50
students in Statistics course.

Test Results Number of students

8
14 – 17 15
18 – 21 15
22 – 25 7

40
Basic Statistics

26 _ 30 5
Total 50

The categories in to which the observations are distributed are called classes or class intervals.
The classes should be set so that they contain all items and no two classes share the same
item. This is the basic principle in the construction of such frequency distributions. We will
define some concepts associated with continuous frequency distributions in the following
way.

Class limits: In the above table the students are distributed in to different classes. There are 8
students with scores between 10 and 13. The numbers 10 and 13 are called lower- and upper-
class limits, respectively. There are 15 students with scores between 14 and 17. The numbers
14 and 17 are called lower- and upper-class limits. Respectively

Class limits are therefore the lowest and highest values that can be included in a class. In the
above examples, the numbers 10, 14, 18 and 22 are called the lower-class limits (LCL) and
the numbers 13, 17, 21 and 25 are called the upper-class limits (UCL)

Class boundaries (real class limits): A class boundary is a number that does not appear in
the stated class limits but is rather a value that falls midway between the upper limit of one
class and the lower limit of the next large one.

In practice, the class boundaries are obtained by adding the upper-class limit of one class
interval to the lower limit of the next higher-class interval and dividing it be 2.

Let d= LCL of the second class - UCL of the first class.

Then adding ½ d to upper limits gives the upper-class boundaries (UCB) and subtracting ½ d
from the lower limits gives the lower-class boundary (LCB)

For the data in the above table,

d= 14 – 13 =1

41
Basic Statistics

½ d = 0.5

Subtracting 0.5 from the lower Limits gives

9.5, 13.5 17.5 and 21.5 which are the lower-class boundaries

Adding 0.5 to the upper limits gives 13.5, 17.5, 21.5 and 35.5 which are the upper-class
boundaries.

Therefore, the above table becomes

Classes Frequency Class boundaries

10-13 8 9.5-13.5

14-17 15 13.5 – 17.5

18-21 20 17.5-21.5

22-25 20 21.5-25.5

Total 50

Or Class boundaries are obtained by subtracting half of the unit of measurement (u) form the
lower limits and by adding half of u on the upper-class limits of a class.

Where u is the distance between two possible consecutive measures. It is usually taken as 1,
0.1, 0.01, 0.001,

u= 1 if all the observations are whole numbers

u=0.1 if all the observations are to one decimal places

u= 0.01 if all the observations to two decimal places

Let LCB= Lower class boundary

UCB= upper class boundary


42
Basic Statistics

u u
Then LCBi  LCLi  and UCBi  UCLi 
2 2

For the data in the above example, consider the 2nd class 14-17, since u =1,

LCL2  14,UCL2  17,u  1  0.5


2 2
LCB 2  14  0.5  13.5,,UCB 2  17  0.5  17.5

Class width (class size): The size or width of a class interval is the difference between the
upper- and lower-class boundaries and is preferred to as the class width, class size or class
length

i.e CWi  UCBi  LCBi

In the above table, for the first class, the class width is 13.5 -9.5 =4 and the second class 14-
17 have class width 17.5 -13.5 =4. In this table all classes have equal size which is 4.

When all the classes are of the same size the class width can also be obtained as the difference
between any two consecutive lower limits or upper limits EX: see the above table.

Class mark or class mid-point or the class interval: is a value which lies mid-way between
the lower and upper limits of the class and is obtained by adding the lower- and upper-class
limits and dividing the sum by two.

LCL  UCL
class mark  CM  
i.e. 2
LCB  UCB

2

Note that when the class size is uniform in a distribution, after finding the class mark of the
first class the remaining are obtained by adding the class size. So, in the case of classes with
the same size, the class width can also be obtained as the difference between any two
consecutive class marks.

43
Basic Statistics

For the distribution of the above table, the class mark of the first class is 11.5 then the class
mark of the second class in 11.5+4 = 15.5 the class mark of the 3rd class is 15.5+4=19.5 and
that of the fourth class is 19.5+4 = 23.5 then we can have the following table.

Class Limits Class Boundaries Class Mark Frequency

10-13 9.5-13.5 11.5 8

14-17 13.5-17.5 15.5 15

18-21 17.5-17.5 19.5 20

22-25 21.5-25.5 23.5 7

Total 50

Guidelines for classes

1. There should be between 5 and 20 classes.


2. The classes must be mutually exclusive. This means that no data value can fall into two
different classes
3. The classes must be all inclusive or exhaustive. This means that all data values must be
included.
4. The classes must be continuous. There are no gaps in a frequency distribution.
5. The classes must be equal in width. The exception here is the first or last class. It is possible
to have a "below ..." or "... and above" class. This is often used with ages.

Basic principles for constructing a continuous frequency distribution

The basic principles or steps in constructing a continuous frequency distribution are:

1. Determine the number of classes that will be used to group the data.

The number of classes should be neither so large as to destroy the advantage of classification,
nor be so small that the chief characteristic of the data is missed. The exact number of classes

44
Basic Statistics

to use depends upon the number of figures to be classified, the size of figures, the purpose that
data has to serve and the arbitrary preference of the analyst.

A small number of items to be classified justify a small number of classes. For example, if we
classify 30 items into 20 classes, we would lose more than we gain from the classification. If,
on the other hand, we classify 15,000 items in to 5 classes we would probably give away too
much information.

So, in general the approximate number of classes depends upon the number of measurements
and the following rough information gives us a good hint.

Sturges’ Rule

To fix the number of classes (k) one can use the above method, a personal judgment
depending up on the nature of investigation or decide with the help of Sturges’ Rule, stating
that

Number of classes = k=
1+ 3.22 x log(n)

Where N = total number of observations and log is common logarithm.

Generally, the number of classes should be between 5 and 20. That is, not less than 5 and not
greater than 20 classes should be used for any kind of distribution.

2. The size of the class has to be determined.

Whenever possible, all classes should be of the same size. This facilitates the analysis of the
data and simplifies comparison between different classes.

A frequency distribution with equal class size can be presented pictorially with greater ease.

However, in some cases equal size is either impossible or undesirable.

45
Basic Statistics

If the number of classes is known and if it is decided to use classes of equal size, the
determination of the size is relatively simple. Since the class size depends upon the number of
classes and the extent to which the values of the variable are spread or dispersed, the
following simple formula can be used.

Class width
Range
 or cw 
Number of classes
R
k

Where R = Range = Highest Value __ Smallest Value

3. Determine the lower-class limit of the first class so that the smallest item falls in this class. The

remaining lower class limits are obtained using the following relations.

LCL2 = LCL1 + cw, LCL3 = LCL2 + cw, LCL4 = LCL3 + cw, , LCLi+1 = LCLi + cw

4. Determine the upper-class limit of the first class using the formula

UCL1 = LCL1 + cw _ u. The remaining lower-class limits are obtained using the following
relations.

UCL2 = UCL1 + cw, UCL3 = UCL2 + cw, …, UCLi+1 = UCLi + cw

5. Complete the continuous frequency distribution with the respective class frequencies.

Example 2.4: Following are marks of (out of 100) obtained by 50 students in


Statistics.

41 50 69 77 88 92 40 51 67 75 87 94 93 86 72 62 53 49 57 67

70 85 97 95 83 79 68 52 44 44 55 64 75 83 74 60 56 42 56 69

70 42 64 52 63 60 59 61 65 78

46
Basic Statistics

a. Construct a continuous frequency distribution with suitable number of classes.


b. Complete the distribution obtained in (a) with the class boundaries and class marks.

Solution: Step1. Here N = 50, then k = 1+ 3.322 x log50 = 1+ 3.322 x 1.69890004

Thus k = 1+ 5.643978354 = 6.643978354  7

R 97  40
Step2. cw =   8.142857143
k 7

Rounding 8.142857143 to the nearest whole number to facilitate

the construction of the distribution and further the analysis of the data,

the class size will be 9

Step3. Let LCL1 = 40, then LCL2 = 40 + 9 = 49, LC3L3 = 49 + 9 = 58,

LC3L4 = 58+ 9 = 67, LC3L5 = 67 + 9 = 76, LC3L6 = 76 + 9 = 85,

LC3L7= 85 + 9 = 94.

Step 4. UCL1 = 40 + 9 _ 1 = 48, where u= 1.

Then UCL2 = 48 + 9 = 57, UCL3 = 57 + 9 = 66, UCL4 = 66+ 9 = 75,

UCL5 = 75+ 9 = 84, UCL6= 84 + 9 = 93, UCL7 = 93 + 9 = 102

Step 5. Completing the distribution gives the following table.

Class Number of Class Class


limits(Marks Students(frequency) boundaries mark
) s
__ __
40 48 6 39.5 48.5 44
49 __ 57 10 48.5 __ 57.5 53
57 __ 66 9 57.5 __ 66.5 62

47
Basic Statistics

67 __ 75 11 66.5 __ 75.5 71
76 __ 84 5 75.5 __ 84.5 80
85 __ 93 6 84.5 __ 93.5 89
94 __ 102 3 93.5 __
98
102.5
Total 50

2.4. Relative Frequency Distribution

The relative frequency of a class shows the relative concentration of items in a given class
interval to the other classes of a frequency distribution.

class frequency
Relative frequency of a class =
Total frequency

Relative frequencies are usually given as decimals or percentages.

Example 2.5: The following table shows and example of relative frequency distribution.

Number of Relative frequency


Wages (X) Works
In decimals In %
(f)
75-80 9 0.09 9%
80-85 12 0.12 12%
85-90 15 0.15 15%
90-95 11 0.11 11%
95-100 20 0.2 20%
100-105 20 0.2 20%
105-110 11 0.11 11%
110-115 2 0.02 2%

48
Basic Statistics

Cumulative Frequency Distribution

The cumulative frequency of value of a variable (a class) is the sum of all the frequencies
preceding or succeeding that value (class) including the frequency of that value (class) there
are two types of cumulative frequency distributions namely the “less than” cumulative
frequency distribution and the “more than” cumulative frequency distribution.

a) “Less than” cumulative frequency distribution

Less than cumulative frequency for any value of the variable (or class) is obtained by adding
values (or classes), including the frequency of that value (class) against which the totals are
written, provided the values (Classes) are arranged in ascending order of magnitude. Or for
grouped frequency distribution it is the sum of all frequencies lying below the upper class
boundaries of each class.

Example 2.6: The table below shows the ‘less than’ cumulative frequency distribution of
marks of 70 students in a class.

‘Less than’ Cumulative


Marks Frequency
Frequency
30-35 5 5

35-40 10 5+10=15

40-45 15 15+15=30

45-50 30 30+30=60

50-55 5 60+5=65

55-60 5 65+5=7

The above ‘less than’ cumulative frequency distribution can also be written as follows

Marks Frequency

49
Basic Statistics

Less than 30 0
Less than 35 5
Less than 40 15
Less than 45 30
Less than 50 60
Less than 55 65
Less than 60 70

b) ‘More than’ cumulative frequency distribution

The ‘more than’ cumulative frequency is obtained similarly by finding the cumulative totals
of frequencies starting from the highest value of the variable (class) to the lowest value
(class). Thus, in the above illustration the number of students with marks ‘more than 50’ is
5+5= 10, and ‘more than 40’ is 15+30+5+5=55 and so on. The complete ‘more than’ type
cumulative frequency distribution for this data is given below:

‘More than’ cumulative frequency distribution of marks of 70 students

‘More than’
Marks Frequency cumulative
frequency
30-35 5 65+5=70

35-40 10 55+10=65

40-45 15 40+15=55

45-50 30 10+30=40

50-55 5 5+5=10

55-60 5 5

The above ‘more than’ C.F. Distribution can also be expressed in the following form:

50
Basic Statistics

Number of
Marks
students
More than 30 70

More than 35 65

More than 40 55

More than 45 40

More than 50 10

More than 55 5

More than 60 0

Remark: In ‘less than’ C.F. Distribution, the c.f. refers to the upper-class boundary of the
corresponding class and in ‘more than’ cumulative frequency distribution, the c.f. refers to the
lower-class boundary of the corresponding class

Histogram

A histogram is a graphical display of the distribution of a data set. A histogram looks like a
vertical bar graph, except that the columns touch each other.

The given grouped data is plotted in the form of a series of rectangles. Class boundaries are
marked along the x-axis and the frequencies along the y- axis according to a suitable scale. If
all the classes are of the same size, the height of the rectangles can be taken to be numerically
equal to the class frequencies.

If on the other hand the size of the class intervals is not uniform, the height of the rectangles
can be adjusted by taking the “frequency density” of the corresponding classes as scale for the
vertical axis.

51
Basic Statistics

class Frequency
Frequency density 
Class width

A histogram gives us an idea about the shape of the data distribution. It can indicate to us,
graphically, where the center of the data distribution lies. It will also reveal whether the
distribution is symmetric or skewed.

Example 2.11: Construct a histogram for the following frequency distribution

Class fi Class
limits boundary
10-19 4 9.5-19.5
20-29 5 19.5-29.5
30-39 8 29.5-39.5
40-49 6 39.5-49.5
50-59 2 49.5-59.5

Solution:

10
8
Frequency

6
4
2
0
9.5 19.5 29.5 39.5 49.5
Class boundary

Exercises:

What is a frequency distribution? What benefits does it offer in the summarization and
reporting of data values?

52
Basic Statistics

2.5. The Stem-and-Leaf Display and the Dotplot


2.5.1. Steam and leaf plot

A stem and leaf plot is a way to graphically represent a data set by categorizing the data in
which part of the number is shown on the left side of graph and called a “Stem” while the last
digit is shown on the right and called a “Leaf.” Take a long list of numbers, and put them in
order from the smallest to largest. Draw a vertical line. Take all but the last digit from each
number and list them from top to bottom on the left, using each only once. These are the
“Stems.” Now take the last digits and put them on the right side of the line. In order, aligning
them with the proper stems. These are the “Leaves.”

Generally:

 Put the numbers in orders

 Break off the last digit

 Left is Stem, Right is the Leaf

 Graph them!

Example: An insurance company researcher conducted a survey on the number of car thefts
in a large city for a period of 30 days last summer. The raw data are shown. Construct a stem
and leaf plot by using classes 50–54, 55–59, 60–64, 65–69, 70–74, and 75–79.

52 62 51 50 69
58 77 66 53 57
75 56 55 67 73
79 59 68 65 72
57 51 63 69 75
65 53 78 66 55
Solution:

Step 1 Arrange the data in order.

53
Basic Statistics

50, 51, 51, 52, 53, 53, 55, 55, 56, 57, 57, 58, 59, 62, 63,

65, 65, 66, 66, 67, 68, 69, 69, 72, 73, 75, 75, 77, 78, 79

Step 2 Separate the data according to the classes.

(50, 51, 51, 52, 53, 53 ) (55, 55, 56, 57, 57, 58, 59) (62, 63) (65, 65, 66, 66, 67, 68, 69,
69 ) (72, 73, ) (75, 75, 77, 78, 79)

Step 3 Plot the data as shown here.

Leading digit (stem) Trailing digit (leaf)

5| 011233

5| 5567789

6| 23

6| 55667899

7| 23

7| 55789

Note: When the data values are in the hundreds, such as 325, the stem is 32 and the leaf is 5.
For example, the stem and leaf plot for the data values 325, 327, 330, 332, 335, 341, 345, and
347 looks like this.

Steam leaf

32 | 5 7

33 | 0 2 5

34 | 1 5 7

54
Basic Statistics

Example 2: consider the following data on the number of hamburgers sold by a fast-food
restaurant for each of 15 weeks.
1565 1852 1644 1766 1888 1912 2044 1812
1790 1679 2008 1852 1967 1954 1733
A stem-and-leaf display of these data follows.
Leaf unit = 10
15 6
16 4 7
17 3 6 9
18 1 5 5 8
19 1 5 6
20 0 4

Dot Plot

One of the simplest graphical summaries of data is a dot plot. A horizontal axis shows the
range for the data. Each data value is represented by a dot placed above the axis. The dot plot
displays each data value as a dot and allows us to readily see the shape of the distribution as
well as the high and low values.

55
Basic Statistics

Exercise: Draw the dot plot of the above example!

2.6. Other Methods for Visual Representation of the Data

Dear students, can you mention some diagrammatic and graphical methods of data
presentation? Do we use the same diagram and graph for different purposes or do we have
different diagrams and graphs for different purposes? If know them, compare with what is
discussed below. We will discuss them in this section.

2.6.1. Diagrammatic presentation of data

When the basis of classification is not quantitative, i.e.; when the data are of attribute nature,
statistical data can be presented diagrammatically using charts. The charts could be bar chart,
pie-chart or picot gram all of which having specific uses depending upon the nature of the
information to be depicted.

Bar charts

Bar charts are drawn almost in the same way as graphs. Data are presented by a series of bars,
the heights of each bar showing the size of the observation represented.

While drawing the bar charts:

I. The width of the bar should be kept uniform and


II. The graphs between successive bars should be remain the same

There are four main types of bar charts serving different purpose. These are simple bar charts,
component charts, percentage component bar charts and multiple bar charts.

Simple bar chart

In simple bar charts, each bar represents one and only one figure. A simple bar chart is usually
constructed to represent total only.

Example 2.7: the following table shows the number of student attending in four departments.

56
Basic Statistics

Department Mathematics Statistics Physics Chemistry


Number of student 56 45 40 50

Construct a simple bar chart for the above table.

Solution:

60 56
50
Number of students

50 45
40
40

30

20

10

0
Math. Stat. Physics Chemistry

Department

Component (sub-divided) bar chart

The component bar chart gives the break up in parts which constitutes the aggregate in a year
place or sector. In such type of chart, it is possible to compare changes in part, in aggregate,
as well.

Example 2.8: The table and chart below show the revenue, expenditure of a country on
education

57
Basic Statistics

Education Expenditure (in million)


1978-80 1980-81 1981-82
Primary 60 80 40
Secondary 40 60 60
Higher Education 20 40 20
Total 120 180 120

Primary
200
Secondary
Higher Education
150

100

50

0
1978-80 1880-81 1981-82

Multiple bar charts

Here the interrelated components parts are shown adjoining bars, coloured or marked
differently. This allows comparison between different parts.

Example 2.9: The charts below show the revenue expenditure of a country in education

58
Basic Statistics

Primary
90
Secondary
80
Higher Education
70
60
50
40
30
20
10
0
1978-80 1880-81 1981-82

Pie-chart (angular chart)

A pie-chart is a circle divided by radical lines into sections or slice so that the area of each
section is proportional to the size of the amount represented. It is a simple description display
of data that sum to a given total. A pie-chart is probably the most illustrative way of
displaying quantities as percentage of a given total. The total area of the pie represents 100
percent of the quantity of interest (the sum of the variable values in all categories of the slice
denotes. Thus, a pie-chart indicates relative frequencies by slicing up a circle into distinct
sectors.

The sum of angles at a point being 360o, the component parts of the data are expressed as
proportions of 360o and the sectors of circle represent these parts. The degrees corresponding
to components are obtained by dividing the amount for each item divided by the total and
multiplying by 360o and to be drawn by means of a protractor.

In order to draw pie chart, it is convenient to form beforehand a table of percentages and the
corresponding angles to be drawn at the center of the circle.

Example 2.10: The following table shows the monthly expense of family with income of
1000 Birr.

59
Basic Statistics

Item Food Clothing Rent Others Total


Amount (in Birr) 400 200 250 150 1000

Solution:

Item Amount Degrees Amount


(in Birr) (Size of central angle) (in percentages)
Food 400 144 40
Clothin 200 72 20
g
Rent 250 90 25
Others 150 54 15
Total 1000 360 100

Pie-chart for the above table


Food
Clothing
15% Rent
Others
40%

25%

20%

2.3.2. Graphical presentation

After organizing data in to frequency distributions it is often helpful to present data


graphically. Graphs communicate the essential characteristics of a frequency distribution in

60
Basic Statistics

pictorial form so that one can readily identify these characteristics and can compare one
frequency distribution with another.

Histogram, frequency polygon and cumulative frequency curves are common ways of
representing frequency distribution graphically.

Frequency polygon (line graph)

Is a line chart of frequency distribution in which either the values of discrete variables or the
class marks of classes are plotted against the frequencies and these plotted points are joined
together by straight lines?

It is thus a graphic presentation tool that may be used as an alternative to the histogram. For a
large number of classes, a frequency polygon is preferable.

For a frequency distribution where class intervals are equal, the area of frequency polygon is
equal to the area of the histogram.

Example2.12: Construct a frequency polygon for the above frequency distribution.

Solution:

Class
limits Frequency Class Marks
10_19 4 14.5
20_29 5 24.5
30_39 8 34.5
40_49 6 44.5
50_59 2 54.5

61
Basic Statistics

10

Frequency 6
4

0
4.5 14.5 24.5 34.5 44.5 54.5 64.5
Class mark

Remark: We enclose the polygon to imaginary class marks to the left and to the right of the
extreme class marks.

Cumulative frequency curve (ogive)

It is the graphic representation of cumulative frequency distribution.

The ogive curve can be traced either on less than basis or more than basis.

a) ‘Less than Ogive’: Upper class boundaries are plotted against the ‘less than’ cumulative
frequencies.
b) ‘More than’ Ogive: Lower class boundaries are plotted against the ‘more than’ cumulative
frequencies.

Example2.13: Construct (a) the ‘Less than’ ogive and

(b) the ‘More than’ ogive for the above frequency distribution.

Solution:

Less than More than


Class cumulative cumulative
limits Frequency frequency frequency
10 _ 19 4 4 25
20 _ 29 5 9 21
62
Basic Statistics

30 _ 39 8 17 16
40_ 49 6 23 8
50 _ 59 2 25 2

Frequency

25
20
15
10
5
0
9.5 19.5 29.5 39.5 49.5 59.5
Upper Class boundary

Frequency

25
20
15
10
5
0
9.5 19.5 29.5 39.5 49.5 59.5

Lower Class boundary

The Scatter Diagram

Scatter diagram is a graphical presentation of the relationship between two quantitative


variables. One variable is shown on the horizontal axis and the other variable is shown on the
vertical axis.

63
Basic Statistics

Depending on the kinds of numerical data we have we may have the scatter plot of the
following forms.

Example: The local ice cream shop keeps track of how much ice cream they sell versus the
noon temperature on that day. Here are their figures for the last 12 days:
Ice Cream Sales vs Temperature

Temperature °C Ice Cream Sales

14.2° $215

16.4° $325

11.9° $185

15.2° $332

18.5° $406

22.1° $522

19.4° $412

64
Basic Statistics

25.1° $614

23.4° $544

18.1° $421

22.6° $445

17.2° $408

The scatter diagram of the data is:

The scatter diagram with the fitted line is given as:

65
Basic Statistics

2.7. Tabulation and Contingency Tables

A statistical table is an orderly and systematic presentation of numerical data in rows and
columns. Rows (stubs) are horizontal and columns (captions) are vertical arrangements. The
use of tables for organizing data involves grouping the data into mutually exclusive categories
of the variables and counting the number of occurrences (frequency) to each category. These
mutually exclusive categories, for qualitative variables, are naturally occurring groupings.

In the case of large size quantitative variables like weight, height, etc. measurements, the
groups are formed by amalgamating continuous values into classes of intervals. There are,
however, variables which have frequently used standard classes. One of such variables, which
have wider applications in demographic surveys, is age.

The simple frequency table is used when the individual observations involve only to a single
variable whereas the cross tabulation is used to obtain the frequency distribution of one
variable by the subset of another variable. In addition to the frequency counts, the relative
frequency is used to clearly depict the distributional pattern of data. It shows the percentages
of a given frequency count.

On the other hand, in cross tabulated frequency distributions where there are row and column
totals, the decision for the denominator is based on the variable of interest to be compared
over the subset of the other variable.

Although there are no hard and fast rules to follow, the following general principles should be
addressed in constructing tables.

1. Tables should be as simple as possible.


2. Tables should be self-explanatory.
3. If data are not original, their source should be given in a footnote
Cross tabulation
A cross tabulation is a tabular summary of data for two variables. The classes for one variable
are represented by the rows; the classes for the other variable are represented by the columns.
The following cross tabulation shows household income by educational level of the head of
household (Statistical Abstract of the United States: 2008).
66
Basic Statistics

Chapter Summary

The first step in any statistical investigation is collecting relevant data. After the data has been
collected, they have to be organized and presented in a systematic manner. Common methods
of presenting numerical data are frequency distributions, diagrammatic and graphical
methods. Frequency distributions are of different types. These are: ungrouped or discrete,
qualitative, grouped or continuous, relative, and cumulative frequency distributions.

Based on the frequency distributions mentioned above we can present our data using different
diagrams and graphs. The different types of diagrams discussed are bar charts (simple,
component, percentage component, multiple, and pie chart). We have different types of
graphs for presentation of data such as histogram, frequency polygon, cumulative frequency
curves (ogive curves), line graph and vertical line graph.

67
Basic Statistics

Check list

Dear students, have you made yourself familiar with the following points mentioned below?
If not, make sure that you have understood them very well by returning back and referring the
units they are found in.

 Construction of relative and cumulative frequency distribution for raw data


 Identifying class marks, class width and class boundaries
 Presentation of numerical data using graphs and charts
 Identifying between categorical and continuous frequency distributions
 Principles to be followed for constructing frequency distributions
 Identifying between the 'less than' and 'more than' cumulative frequency distributions
 Distinguish between the different types of diagrams
 Distinguish between the different types of graphs
 Construct different graphs and diagrams for a given data set

Exercises

1. For 75 employees of a large department store, the following distribution for years of service
was obtained. Construct a histogram, frequency polygon, and ogive for the data.

Class limits Frequency


1–5 21
6–10 25
11–15 15
16–20 0
21–25 8
26–30 6
A majority of the employees have worked for how many years or less?

68
Basic Statistics

2. The salaries (in millions of dollars) for 31 NFL teams for a specific season are given in this
frequency distribution.

Class limits Frequency


39.9–42.8 2
42.9–45.8 2
45.9–48.8 5
48.9–51.8 5
51.9–54.8 12
54.9–57.8 5
Construct a histogram, frequency polygon, and ogive for the data; and comment on the shape
of the distribution.
3. In the following stem-and-leaf display for a set of two-digit integers, the stem is the 10’s digit,
and each leaf is the 1’s digit. What is the original set of data?
2 |002278
3 | 011359
4 |1344
5 |47
4. The National Insurance Crime Bureau reported that these data represent the number of
registered vehicles per car stolen for 35 selected cities in the United States. For example, in
Miami, 1 automobile is stolen for every 38 registered vehicles in the city. Construct a stem
and leaf plot for the data and analyze the distribution. (The data have been rounded to the
nearest whole number.)
38 53 53 56 69 89 94
41 58 68 66 69 89 52
50 70 83 81 80 90 74
50 70 83 59 75 78 73
92 84 87 84 85 84 89
Source: USA TODAY.

69
Basic Statistics

5. Frequencies and percent frequencies for mutual fund type


Mutual Fund Type Frequency Percent Frequency
Domestic Equity 16 64
International Equity 4 16
Fixed Income 5 20
Totals 25 100
Draw bar chart, pie chart
6. In Jimma, 1114 adults were asked, “How would you rate the Commercial Bank in handling
the credit problems in the financial markets?” The percent frequency distribution obtained
follows:
Rating Percent Frequency
Excellent 0
Good 4
Fair 46
Bad 40

Terrible 10

Draw the dot plot, bar chart, pie chart.


7. The following are the number of automobile accidents that occurred at 60 majors’
intersections in a certain city during the Fourth of July weekend:

0 2 5 0 1 4 1 0 2 1

5 0 1 3 0 0 2 1 3 1

1 4 0 2 4 1 2 4 0 4

3 5 0 1 3 6 4 2 0 2

0 2 3 0 4 2 5 1 1 2

2 1 6 5 0 3 3 0 0 4

70
Basic Statistics

a) Group these data into a frequency distribution showing how often each of the values occur
and draw a bar chart.
b) Convert the distribution obtained in (a) above into a cumulative “or more “distribution and
draw its ogive.
8. In a 2-week study of the productivity of workers, the following data were obtained on the
total number of acceptable pieces which 100 workers produced:

65 36 49 84 79 56 28 43 67 36

43 78 37 40 68 72 55 62 22 82

88 50 60 56 57 46 39 57 73 65

59 48 76 74 70 51 40 75 56 45

35 62 52 63 32 80 64 53 74 34

76 60 48 55 51 54 45 44 35 51

21 35 61 45 33 61 77 60 85 68

45 53 34 67 42 69 52 68 52 47

63 65 55 61 73 50 53 59 41 54

41 74 82 58 26 35 47 50 38 70

a) Construct a grouped frequency distribution with suitable number of classes.


b) Convert the distribution obtained in (a) in to a cumulative

i) less than and


ii) more than distribution.

c) Construct a histogram, frequency polygon, and ogives.


9. The following data shows the average yearly consumption of meat in kilograms for 40
71
Basic Statistics

families.

12.6 17.8 19.9 19.0 10.4 20.6 13.2 22.5

14.0 15.6 19.1 20.4 20.6 18.6 18.0 15.9

13.7 14.9 18.7 18.4 20.1 24.2 19.3 13.9

11.7 16.7 15.3 18.3 17.4 23.4 22.0 17.9

21.7 18.9 14.4 9.9 16.0 16.8 10.8 16.2


a. Construct a continuous frequency distribution with suitable number of classes
b. Construct the less than ogive.
c. Construct the relative frequency distribution.
10. A frequency distribution has 6 classes of equal size is constructed to present data
which is recorded in integers. Given the class midpoint of the 3rd class interval is 20 and the
class width is 5, write the classes. The table below shows the weight distribution of 25
students in basketball team.

Weight in kgs Number of students

Below 50.5 3

Below 55.5 10

Below 60.5 16

Below 65.5 20

Below 70.5 22

Below 75.5 25

a. Determine the class limits and the class marks.


b. How many of the students have weight more than 65.5 kgs? Between and
70.5 kgs?

72
Basic Statistics

11. The following table shows the type of cars manufactured by a certain company during 1972-
1975.

Years
Cars 1972 1973 1974 1975
Toyota 400 300 380 450

Nisan 260 340 350 390

Isuzu 330 310 445 470

Construct

a. A simple bar chart for the total number of cars manufactured.


b. Multiple bar charts.
c. Component bar chart
d. Percentage component bar chart.
12. For six local offices of a large tax preparation firm, the following data describe x = service
revenues (in thousand dollar) and y = expenses (in thousand dollar) for supplies, freight, and
postage during the previous tax preparation season:

X Y
351.4 18.4
291.3 15.8
325.0 19.3
422.7 22.5
238.1 16.0
514.5 24.
a. Draw a scatter diagram representing these data.
b. Does there appear to be any relationship between the variables? If so, is the relationship
direct or inverse?
13. A recent study showed that a typical American car owner incurs the following expenses,
on an average, when he leases a car for three years.

73
Basic Statistics

Expenditure item Amount ($)

Lease amount 4,500

Gasoline 1,350

Insurance 1,800

Maintenance 1,350

Draw a pie chart to portray this data

74
Basic Statistics

CHAPTER THREE: STATISTICAL DESCRIPTION OF DATA

3.1. Introduction and Objectives

Dear students, In the Previous chapter you have learned about how to present the collected
data using the appropriate methods of data presentation such as tables, charts and graphs in
order to visualize the simple and easily understandable data. In this chapter, you will learn
about the way that allows you to represent the whole numerical or qualitative data by the
central (average) observation by using measures of central tendencies like mean, mode and
median, as well as how to find the distribution or average dispersion of each value from the
representative average through the measures of variations like variances, standard deviation
and etc. In addition to that you will learn about Quantiles like quartiles, deciles and
percentiles, which is the way to divide the ordered data in to different equal parts to get
partitioned data.

In general, the aim of this chapter is:

 To demonstrate understanding of key statistical technique to reduce data in a single value.


 To determine the appropriate value which represents the whole series
 To make easy comparisons between data.
 To find out how spread out the data values are on the number line.
 Provide a working knowledge of the statistical tools, focusing on how to calculate measures
of central tendencies and dispersions in business.
 To explain the types of measures of central tendencies and measures of variations
 To show you how to make a partition of your data in to different equal parts
After reading this chapter, you should be able to:
 Understand and identify the use of average in business
 Identify the types of measures of central tendencies and knows how to calculate it
 Understand and apply the concept of measures of central tendencies in your daily activities
 Explain what the measures of variation is and list its types
 Understand when to use each type of measures of variations (dispersions)
 Differentiate measures of central tendencies and measures of variations.
 Make a partition of the data you have using Quantile’s concept.
75
Basic Statistics

Introduction

In the previous chapter, we discussed the techniques of classification and tabulation, which
help in summarizing the collected data and presenting them in the form of a frequency
distribution.

Now suppose the students from two or more classes appeared in the examination and we wish
to compare the performance of the classes in the examination or wish to compare the
performance of the same class after some coaching over a period of time. When making such
comparisons, it is not practicable to compare the full frequency distributions of marks.
However compactly these may be presented. Therefore, for such statistical analysis, we need
a single representative value that describes the entire mass of data given in the frequency
distribution. This single representative value is called the central value, measure of location or
an average around which individual values of a series cluster. This central value or an average
enables us to get a gist of the entire mass of data, and its value lies somewhere in the middle
of the two extremes of the given observations. For this reason, such a central value or an
average is frequently called a measure of central tendency.

From the above discussion, it should be clear to you that the concept of a measure of central
tendency is concerned only with quantitative variables and is undefined for qualitative
variables as these are immeasurable on a scale. The first step in looking at data is to describe
the data at hand in some concise way. In smaller studies this step can be accomplished by
listing each data point. In practice, however, this procedure is tedious or impossible and, even
if it were possible would not give an over-all picture of what the data look like. When we
want to make comparison between groups of numbers it is good to have a single value that is
considered to be a good representative of each group. This single value is called the average
of the group. Averages are also called measures of central tendency and it tells us where the
center of the distribution of data is located on the scale we are using. There are several such
measures, but here we shall discuss the most commonly used measure of central tendency.
These include Mean, Median and Mode. There are also other measures of central tendency
(sometimes called non-central location) such as quartiles, deciles and percentiles.

76
Basic Statistics

Objectives

Since the number of sample points is frequently large and it is easy to lose track of the overall
picture by looking at all the data at once, the data must be summarized as briefly as possible.
Thus, the most important objective of data analysis is to determine a single value for the entire
mass of data. In addition to this the following are some objectives of measures of central
tendency.

1. To comprehend (understand) the data easily.


2. To facilitate comparison.
3. To make further statistical analysis.

Before attempting the measures of central tendency, let’s see some of the summation notation
that is used frequently.

3.2. The Summation Notation

Let X X X X be a number of measurements where n is the total number of observation


1, 2, 3… n
th
and X is i observation. Very often in statistics an algebraic expression of the form
i

X +X +X +...+X is used in a formula to compute a statistic. It is tedious to write an


1 2 3 n

expression like this very often, so mathematicians have developed a shorthand notation to
represent a sum of scores, called the Summation Notation.

Notations: Σ is read as Sigma (the Greek Capital letter for S) means “the sum of”. Suppose n
values of a variable are denoted as X1, X2, X3…., Xn then ΣXi = X1 +X2+ X3 +…+Xn
where the subscript i range from 1 up to n which is used to identify the position of an element.

Example: Suppose the following were scores made on the first assignment for Stat 173 of 1 st
year five Sport Science Summer students: 5, 7, 7, 6, and 8. In this example set of five
numbers, where n=5, the summation could be written: ΣXi = X1,+X2,+ X3 + X4,+ X5 = 5 +
7+7+6+8=33.

77
Basic Statistics

Properties of Summation

1. Σ(Xi Yi) = ΣXi ΣYi , where the number of X values = the number of Y values.

2. ΣKXi = k×ΣXi , where K is a constant.

3) ΣK= n×K, where K is a constant and the summation ranges from 1 to n.

Important Characteristics of a Good Average

If an average is good representative, then it is said to be typical average and an average is not
good representative (only a theoretical value) then it is said to be a descriptive average.

A typical average should possess the following:

 It should be rigidly defined.


 It should be based on all observation under investigation.
 It should be as little as affected by extreme observations.
 It should be capable of further algebraic treatment.
 It should be as little as affected by fluctuations of sampling.
 It should be ease to calculate and simple to understand.

3.3. Types of measures of central tendency

There are several different measures of central tendency; each has its own advantages and
disadvantages. Among those:

 Mean (Arithmetic, Weighted, Geometric and Harmonic)


 Mode
 Median
 Quantiles (Quartiles, deciles and percentiles)

The choice of these averages depends up on which best fit the property under discussion.

78
Basic Statistics

3.3.1. Mean

The various methods of determining the actual value at which the data tend to concentrate are
called measures of central Tendency or averages or mean. Hence, an average is a value which
tends to sum up or describe the mass of the data. There are four types of Means which are
suitable for a particular type of data. These are Arithmetic Mean, Weighted Arithmetic Mean,
Geometric Mean and Harmonic Mean.

The Arithmetic Mean (or simply the mean): is the most popular and widely used as well as
best-understood measure of central tendency for the set of quantitative data set. The
arithmetic mean of a set of observation is defined as the sum of the values of all observation
in a series divided by the number of items in the series. Suppose X1, X2, X3, …,Xn are n
observed values in a sample of size n from a population of size N, n<N, then the arithmetic
mean for ungrouped data of the sample, denoted by is given

as: .

If we take an entire population Mean is denoted by 𝜇 and is given by:

𝜇= , where N stands for the total number of observations in the

population.

Example: Suppose the sample consists of birth weights (in grams) of all live born infants at a
private hospital in a certain city during a 1-week period. These samples are, 3265 3323 2581
2759 3260 3649 2841 3248 3245 3200 3609 3314 3484 3031 2838
3101 4146 2069 3541 2834; then find arithmetic mean for the sample birth
weights.

Solution: = = (3265 + 3260 + …. + 2834) = 3166.9 g.

If X is a variable having values X1, X2,,Xm occurring with frequencies of f1, f2,…, fm
respectively, then its arithmetic mean is given by:

79
Basic Statistics

Example: Suppose the X values are 3, 5, 4, 2, 7 and 6 with corresponding frequencies of 2, 1,


3, 2, 1 and 1 respectively. Then fine the mean for frequent data.

Solution: = = = = 4.

Mean for Grouped Data

This method is applicable where the entire range of observations has been grouped into a
continuous frequency distribution (grouped frequency distribution). The value or score (X) of
each observation is assumed to be identical with the mid-point (mi) of the class interval to

which it belongs. In such cases the mean of the distribution is computed as: = ,

where

 k is number of classes,
 mi is the midpoint of the ith class and
 fi is the ith class frequency.

In order to compute the mean of grouped data we should considered the following:

 First find the class marks


 Find the product of frequency and class marks
 Find mean using the formula given above.

80
Basic Statistics

Example: Calculate the mean for grouped data on the amount of time (in hours) that 80
college students devoted to leisure (vacation) activities during a typical school week given
below:

Time spent (hours) Frequency

10 – 14 8

15 – 19 28

20 – 24 27

25 – 29 12

30 – 34 3

35 – 39 1

40 – 44 1

Solution: the class marks of the distribution are: 12, 17, 22, 27, 32, 37, 42.

Then the mean of the data is computed as:

= = = = 20.7 hours.

Special Properties of the Arithmetic Mean

i) Characteristics:

1) The sum of the deviations about the mean is zero. i. e., .


2) The sum of the squares of deviations from the arithmetic mean is less than the sum of squares

of deviations about any other value in the data set, i. e.; .

81
Basic Statistics

3) If we have means and of two groups having the same unit of measurements of a
variable, based on n1 and n2 observations respectively we can compute the mean of the

combined groups ( ) which is given by: = . This is true for more than two

groups having the same unit of measurements of a variable.

Example: If the mean of one class of 50 students are 30 and the mean of marks of another
class of 100 students are 40. What is the mean of all 150 students?

Solution: based on the above formula it is (50*30 + 100*40)/(50 + 100) =36.7.

4) If a wrong figure has been used when calculating the mean, then the correct mean can be
obtained without repeating the whole process using:

= , where n is total number of observations.

Example: An average weight of 10 students was calculated to be 65. Latter it was discovered
that one weight was misread as 40 instead of 80 k.g. Calculate the correct average weight.

Solutions: = = (10*65 – 40 + 80)/ 10 = 69

5) The effect of transforming original series on the mean.

a) If a constant k is added/ subtracted to/from every observation then the new mean will be the
old mean ± k respectively.
b) If every observations are multiplied by a constant k then the new mean will be k*old mean.
c) If every observations are divided by a constant k then the new mean will be 1/k*old mean.

Example: The mean of n Tetracycline Capsules X , X , …, Xn are known to be 12 gm. New


1 2

set of capsules of another drug are obtained by the linear transformation Y = 2X – 0.5, i = 1,
i i

2, …, n, then what will be the mean of the new set of capsules?

Solutions: New Mean = 2* Old Mean – 0.5 = 2*12 – 0.5 = 23.5.

82
Basic Statistics

Exercise: The mean of a set of numbers is 500.

a. If 10 is added to each of the numbers in the set, then what will be the mean of the new set?
b. If each of the numbers in the set are multiplied by -5, then what will be the mean of the new
set?

ii) Advantages:

 It is based on all values given in the distribution.


 It is most easily understood.
 It is most amenable (agreeable, acquiescent) to algebraic treatment.
 It is stable average, i.e. it is not affected by fluctuations of sampling to some extent.
 It is suitable for further mathematical treatment.

iii) Disadvantages:

 It is affected by extreme observations.


 It cannot be used in the case of open end classes.
 It cannot be determined by the method of inspection.
 It cannot be used when dealing with qualitative characteristics, such as intelligence, honesty,
beauty.
 It can be a number which does not exist in a serious.
 Sometimes it leads to wrong conclusion if the details of the data from which it is obtained are
not available.

Weighted Mean

In computation of mean we had given equal importance to each observation. While, when
averaging quantities, it is often necessary to account for the fact that not all of them are
equally important in the phenomenon being described. In order to give quantities being
averaged there proper degree of importance, it is necessary to assign them relative importance
called weights, and then calculate a weighted mean.

83
Basic Statistics

In general, the weighted mean w of a set of values X1, X2, …,Xn, whose relative importance
is expressed numerically by a corresponding set of weights W1, W2, … Wn, is given by:

Example: A student obtained the following percentage in an examination: English 60,


Management 75, Mathematics 63, Economics 59, and Statistics 55.Find the students weighted
arithmetic mean if weights 1, 2, 1, 3, 3 respectively are allotted to the subjects.

Solutions:

Solution:

= = (60*1 +75*2 + 63*1 + 59*3 + 55*3)/ (1+2+1+3+3) = 615/10 = 61.5.If

we don't consider the weights, the average mark of student will be 123 and it is totally
wrong!!

Geometric Mean
If the observed values are measured as ratios, proportions or percentages and the series of
observations contains one or more unusually large values geometric mean gives a better
measure of central tendency than other means.

It is obtained by taking the nth root of the product of “n” values, i.e, if the values of the
observation are demoted by X1,X2,…,Xn, then

GM = .

Whenever the frequency distribution are grouped (continuous), class marks (mi) of the class
intervals are considered as Xi and the second formula of the above can be used.

84
Basic Statistics

But if n is a large number, the problem of computing the nth root of the product of these values
by simple arithmetic is a tedious work. To facilitate the computation of geometric mean we
make use of logarithms. The above formula when reduced to its logarithmic form will be:

GM = n√(X1)(X2)….(Xn) = { (X1)(X2)… (Xn )}1/n

The logarithm of the geometric mean is equal to the arithmetic mean of the logarithms of
individual values. The actual process involves obtaining logarithm of each value, adding them
and dividing the sum by the number of observations. The quotient so obtained is then looked
up in the tables of anti-logarithms which will give us the geometric mean.

Example: The geometric mean may be calculated for the following parasite counts per 100
fields of thick films that was taken from 42 samples, 7 8 3 14 2 1 440 15 52
6 2 1 1 25 12 6 9 2 1 6 7 3 4 70 20 200 2 50
21 15 10 120 8 4 70 3 1 103 20 90 1 237

Solution: GM = => log Gm = 1/42 (log 7+log8+log3+..+log 237)

= 1/42 (.8451+.9031+.4771 +…2.3747) = 1/42 (41.9985) = 0.9999 ≈ 1.0000.

The anti-log of 0.9999 is 9.9992 ≈10 and this is the required geometric mean.

By contrast, the arithmetic mean, which is inflated by the high values of 440, 237 and 200 is
39.8

Properties of Geometric Mean

i) Characteristics

1. It is a calculated value and depends upon the size of all the items.

2. It gives less importance to extreme items than does the arithmetic mean.

3. For any series of items it is always smaller than the arithmetic mean.

4. It exists ordinarily only for positive values.


85
Basic Statistics

ii) Advantages

1) Since it is less affected by extremes it is a more preferable average than the arithmetic mean

2) It is capable of algebraic treatment

3) It based on all values given in the distribution.

iii) Disadvantages

1) Its computation is relatively difficult.

2) It cannot be determined if there is any negative value in the distribution, or where one of the
items has a zero value.

Harmonic mean (HM)

It is a suitable measure of central tendency when the data pertains to speed, rates and time.
The harmonic mean is defined as the reciprocal of the mean of the reciprocals of a series of
observations. That is let X1, X2, …, Xn be the values of a set of observations, then the

harmonic mean is given by: HM = = .

When the observed values X1, X2, …Xk have the corresponding frequencies f1, f2,…fk

respectively, then the harmonic mean is given by: HM = , where = n.

While the frequency distribution are grouped (continuous), class marks (mi) of the class
intervals are considered as Xi in the frequent data above.

Example: Milk is sold at a place at the rates of 1.8, 2, 2.25 and 2.5 birr per liter in four
different months. Assuming equal amount of money is spent on milk by a family in the four
months; find the average price paid per liter using harmonic mean.

Solution: HM = 4/(1/1.8 + ½ + 1/2.25 + 1/2.5 ) = 4/1.9 = 2.11 birr.

86
Basic Statistics

Note that if all the observations are positive, we have the relationship among the three means
given as AM >= GM >= HM and all these three means are equal if all positive valued
observations are equal.

3.3.2. Mode

The mode is the value of the observation that occurs with the greatest frequency. A particular
disadvantage is that, with a small number of observations, there may be no mode. In addition,
sometimes, there may be more than one mode such as when dealing with a bimodal (two-
peak) distribution. It is even less amenable (responsive) to mathematical treatment than the
median.

Example: Find the mode for the following data: (a) 22, 66, 69, 70, 73. (No modal value) (b)
1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal value = 3.0 kg). 10, 10, 9, 9, 8, 12, 15, 5
(modal value = 9 and 10). Hence, it is possible for a frequency distribution to have more than
one mode.

Note: Distributions with one mode are called unimodal, those with two modes are called
bimodal, and those with more than two modes are called multimodal.

Modal Vales for Grouped data

To find the Modal value for grouped (continuous) frequency distribution, first find the modal
class which is the class that contains the mode and it is the class with the highest frequency.
Then to compute the modal value for grouped data, we use the formula:

=L+ * w , where

L = Lower class boundary of the modal class;

w = the class width of the modal class;

;
87
Basic Statistics

= frequency of the class immediately succeeding the modal class.

Note: The modal class is a class interval with the highest frequency.

Example: consider the above data set for the amount of time (in hours) that 80 college
students devoted to leisure activities during a typical school week and calculate the modal
value of the data.

Solution: the modal class interval is the second class as the frequency (=28) is the highest
among all others. Therefore,

=27 (frequency of the class immediately succeeding the modal class)

L = 14.5 (Lower class boundary of the modal class)

w = 5 (the class width of the modal class)

) and )

 = 14.5 + (20/(20 +1))*5 = 14.5 + 4.76 = 19.26.

Properties of the Mode

i) Characteristics

1. It is an average of position
2. It is not affected by extreme values
3. It is the most typical value of the distribution

88
Basic Statistics

ii) Advantages

1. It is not affected by extreme observations.


2. Easy to calculate and simple to understand.
3. It can be calculated for distribution with open end class

iii) Disadvantages

1. It is not based on all observations


2. It is not suitable for further mathematical treatment.
3. Often its value is not unique.

Note: being the point of maximum density, mode is especially useful in finding the most popular
size in studies relating to marketing, trade, business, and industry. It is the appropriate average
to be used to find the ideal size.

3.3.3. The Median and Other measures of Locations

I. The Median

An alternative measure of location, perhaps second in popularity to the arithmetic mean, is the
median. In a distribution, median is the value of the variable which divides it in to two equal
halves. In an ordered series of data median is an observation lying exactly in the middle of
the series. It is the middle most value in the sense that the number of values less than the
median is equal to the number of values greater than it.

Suppose there are n observations in a sample and if these observations are ordered from
smallest to largest, then the sample median foe ungrouped data is defined as:

(1) The observation (value) if n is odd

(2) The average of the and observations (values) if n is even.

The rational for these definitions is to ensure an equal number of sample points on both sides
of the sample median. The median is defined differently when n is even and odd because it is
89
Basic Statistics

impossible to achieve this goal with one uniform definition. For samples with an odd sample
size, there is a unique central point; for example, for sample of size 7, the fourth largest point
is the central point in the sense that 3 points are both smaller and larger than it. For samples
with an even size, there is no unique central point and the middle 2 values must be averaged.
Thus, for sample of size 8, the fourth and the fifth largest points would be averaged to obtain
the median, since neither is the central point.

Example: Find the median of the following numbers.

(a) 5, 2, 8, 9, 4. (b) 2, 1, 8, 3, 5, 8.

Median for Grouped Data

For a grouped (continuous) frequency distribution, median is calculated as:

, where

L = true lower limit (lower class boundary) of the median class

w = length of the interval

n = total frequency of the sample

CF = Cumulative frequency preceding the median class.

½n = Number of observations to be counted off from one end of the distribution to reach the
median and

f = Frequency of that interval containing the median.

To calculate median we have the following steps.

i. Construct the less than cumulative frequency of the table.


ii. Find the median class. For this divide the total number of observation by 2 and then reach for
the smallest cumulative frequency which is greater than or equal to n/2.

90
Basic Statistics

iii. Use the formula given above.

NB: the median of grouped data is also the value of X on the horizontal axis which
corresponds to the intersection point of the less than Ogive and more than Ogive (they
intersects at n/2) if they are drawn in the same plane.

Example: consider the above data set for the amount of time (in hours) that 80 college
students devoted to leisure activities during a typical school week and calculate the median of
the data.

Solution: n/2 = 80/2 =40. The class interval that contains the 40th observation from the less
than cumulative distribution is the third-class interval.

Therefore, n/2 = 40, L= 19.5, CF = 36, f=27 and w = 5.

= 19.5 + (40 - 36)*5/27 = 19.5 + 20/27 = 20.241.

Properties of the Median

i) Characteristics

1) It is an average of position.

2) It is affected by the number of items than by extreme values.

ii) Advantages

1) It is easily calculated and is not much disturbed by extreme values

2) It is more typical of the series

3) The median may be located even when the data are incomplete, e.g, when the class intervals
are irregular and the final classes have open ends.

iii) Disadvantages

91
Basic Statistics

1. The median is not as well suited to algebraic treatment as the arithmetic, geometric and
harmonic means.
2. It is not as generally familiar as the arithmetic mean.

Selecting from the measure of central tendency

Selecting among the mean the median and the modal value is based on the following
principles (it is not fast and hard rules).

Which measure of central tendency is appropriate?

Is the data
categorical?

Use modal value


Yes
No

Is total of
interest?
Yes Use Mean
No

Is
Yes Use Median
distribution
skewed?
No

Use Mean

92
Basic Statistics

Empirical Relationship between Mean, Median and mode

In case the symmetrical distribution, mean, median and mode coincide. However, for a
moderately asymmetrical (nonsymmetrical) distribution, mean and mode usually lie on the
two ends and median lies between them and they have the following important empirical
relationship, which is mean – mode = 3(mean - median) thus, mode= 3*median – 2*mean.

Example: in a moderately asymmetrical distribution, the mean and the median are 20 and 25
respectively, and then find the mode of the distribution.

Solution: mean – mode = 3 (mean - median) => mode = 3median – 2mean = 3*25 – 2*20 =
35.

3.4. Other measures of Location (Quantiles)

When a distribution is arranged in order of magnitude of items, the median is the value of the
middle term. Their measures that depend up on their positions in distribution quartiles,
deciles, and percentiles are collectively called quantiles.

Quartiles: Quartiles are measures that divide the frequency distribution in to four equal
parts. The value of the variables corresponding to these divisions are denoted Q , Q , and Q
1 2 3

often called the first, the second and the third quartile respectively.

Q is a value which has 25% items which are less than or equal to it. Similarly, Q has 50%
1 2

items with value less than or equal to it and Q has 75% items whose values are less than or
3

equal to it.

The Kth quartile Qk for ungrouped data is the value of the item which is the

position, wher K =1, 2, 3 and n is the total number of observations.

The computation of three quartiles for a grouped data can be done as follows:

 Calculate kn/4 and search for the minimum cumulative frequency which is greater than or
equal to kn/4, k=1, 2, 3.
93
Basic Statistics

 The class corresponding to this cumulative frequency is the kthquartile class. This is the class
where Qk lies.

 Thus, Qk = L + , k =1, 2, 3, where

L = lower class boundary of the kth quartile class

n= the total number of observations

CF = the less than cumulative frequency corresponding to the class immediately preceding the
kth quartile class

C= the class width of the quartile class and f= frequency of the kth quartile class

Deciles

Deciles are measures that divide the frequency distribution in to ten equal parts. The values of
the variables corresponding to these divisions are denoted D , D ,.. D often called the first,
1 2 9

the second, the ninth decile respectively.

To find D (i=1, 2,..9) we count iN/10 of the classes beginning from the lowest class.
i

For grouped data we have the following formula:

Dk = L + , k =1, 2, 3…9, where

L = lower class boundary of the kth deciles class

n= the total number of observations

CF = the less than cumulative frequency corresponding to the class immediately preceding the
kth deciles class

C= the class width of the deciles class

94
Basic Statistics

F = frequency of the kth deciles class

Percentiles

Percentiles are measures that divide the frequency distribution in to hundred equal parts. The
values of the variables corresponding to these divisions are denoted P , P ,.. P often called
1 2 99

the first, the second,, the ninety-ninth percentile respectively.

To find P (i=1, 2,..99) we count iN/100 of the classes beginning from the lowest class.
i

For grouped data we have the following formula:

Dk = L + , k =1, 2, 3…99, where

L = lower class boundary of the kth percentiles class

n= the total number of observations

CF = the less than cumulative frequency corresponding to the class immediately preceding the
kth percentiles class

C= the class width of the percentiles class

F = frequency of the kth percentiles class

Note: 1) To compute quantiles, we first sort the data in ascending order.

2) Q2 = D5 = P50 = median, P25 = Q1, P75 = Q3, and Di = Pi*10, i=1, 2, 3,…9.

3) Quantiles have the advantage that being less sensitive to outliers and of not being much
affected by the sample size (n).

95
Basic Statistics

3.5. Statistical Description: Measures of Dispersion


3.5.1. Introduction and Objectives of Measuring Dispersion

The term dispersion is generally used in two senses. Firstly, dispersion refers to the variations
of the items among themselves. If the value of all the items of a series is the same, there will
be no variation among different items of a series; the more will be the dispersion. Secondly,
dispersion refers to the variation of the items around an average. If the difference between the
value of items and the average is large, the dispersion will be high and on the other hand if the
difference between the value of the items and averaging is small, the dispersion will be low.
Thus, dispersion is defined as scatteredness or spreadness of the individual items in a given
series.

The measures of dispersion are helpful in statistical investigation. Some of the main
objectives of dispersion are as under:

1. To determine the reliability of an average: The measures of dispersion help in determining


the reliability of an average. It points out as to how far an average is representative of a
statistical series. If the dispersion or variation is small, the average will closely represent the
individual values and it is highly representative on the other hand, if the dispersion or
variation is large, the average will be quite unreliable.
2. To compare the variability of two or more series: The measures of dispersion help in
comparing the variability of two or more series. It is also useful to determine the uniformity
or consistency of two or more series. A high degree of variation would mean less consistency
or less uniformity as compared to the data having less variation.
3. For facilitating the use of other statistical measures: Measures of dispersion serve the basis
of many other statistical measures such as correlation, regression, testing of hypothesis etc.
4. Basis of statistical quality control: The measure of dispersion is the basis of statistical
quality control. The extent of the dispersion gives indication to the management as to
whether the variation in the quality of the product is due to random factors or there is some
defect in the manufacturing process.

96
Basic Statistics

Consider the following data sets:

Mean

Set 1: 60 40 30 50 60 40 70 50

Set 2: 50 49 49 51 48 50 53 50

Set 3: 50 50 50 50 50 50 50 50

The three data sets have a mean of 50, but obviously set 1 is more “spread out” than set 2 and
set 3 has no variability.

Objectives

The general object of measuring dispersion is to obtain a single summary figure which
adequately exhibits whether the distribution is compact or spread out.

• To judge the reliability of measures of central tendency

• To control variability itself.

• To compare two or more groups of numbers in terms of their variability.

• To make further statistical analysis.

3.5.1.1. Absolute and Relative Measures

Measure’s dispersion may be either absolute or relative

1. Absolute measures of dispersion: Absolute measure is expressed in the same statistical unit
in which the original data are given such as kilograms, tones etc. These measures are suitable
for comparing the variability in two distributions having variables expressed in the same units
and of the same averaging size. These measures are not suitable for comparing the variability
in two distributions having variables expressed in different units.

97
Basic Statistics

2. Relative measures of dispersion: A relative measure of dispersion is the ratio of a measure


of absolute dispersion to an appropriate average or the selected items of the data.

The measures of dispersion which are expressed in terms of the original unit of a series are
termed as absolute measures. Such measures are not suitable for comparing the variability of
two distributions which are expressed in different units of measurement and different average
size. Relative measures of dispersions are a ratio or percentage of a measure of absolute
dispersion to an appropriate measure of central tendency and are thus pure numbers
independent of the units of measurement. For comparing the variability of two distributions
(even if they are not measured in the same unit), we compute the relative measure of
dispersion instead of absolute measures of dispersion.

98
Basic Statistics

It is useful for comparing variation in two or more distributions where units of measurements
are the same. Various measures of dispersions are in use. The most commonly used measures
of dispersions are:

1) Range and Relative Range

2) Quartile Deviation and Coefficient of Quartile Deviation

3) Mean Deviation and Coefficient of Mean Deviation

4) Standard Deviation and Coefficient of Variation.

The Range and Relative Range

The Range (R): The range is the largest score minus the smallest score.

Its formula is:

Where R=Range, L= Largest value in the series, S= smallest value in the series

The relative measures of range, also called coefficient of range, is defined as

The following two distributions have the same range, 13, yet appear to differ greatly in the
amount of variability.

Distribution 1: 32 35 36 36 37 38 40 42 42 43 43 45

Distribution 2: 32 32 33 33 33 34 34 34 34 34 35 45

99
Basic Statistics

For this reason, among others, the range is not the most important measure of variability.

Relative Range (RR): It is also sometimes called coefficient of range and given by:

CR = (highest value – smallest value)/(highest value + smallest value)

Example:

1. Find the relative range of the above two distributions. (Exercise!)

2. If the range and relative range of a series are 4 and 0.25 respectively. Then what is the value of:

a) Smallest observation (Ans. 6)

b) Largest observation (Ans. 10)

Example 4.1: five students obtained the following marks in statistics: . Find the
Range and coefficient of range

Solution: Here,

LS 35  15
Coefficient of Range =   0.4
LS 35  15

Example 4.1: Find out range and coefficient of range of the following series

Size 5- 11- 16- 21-25 26-


10 15 20 30

Frequency 4 9 15 30 40

Solution: Here,

100
Basic Statistics

Example 4.2: Find out range and coefficient of range of the following series

Number Class Class Frequency Class Less than More than


of class limit boundary [F] mid cumulative cumulative
point( frequency[LCF] frequency[MCF]
xi)
1 21---- 20.5---- 4 25 4 100
29 29.5
2 30---- 29.5---- 12 34 16 96
38 38.5
3 39---- 38.5---- 16 43 32 84
47 47.5
4 48---- 47.5---- 23 52 55 68
56 56.5
5 57---- 56.5---- 19 61 74 45
65 65.5
6 66---- 65.5---- 14 70 88 26
74 74.5
7 75---- 74.5---- 9 79 97 12
83 83.5
8 84---- 83.5---- 3 88 100 3
92 92.5
9 93---- 92.5---- 0 97 100 0
101 101.5
Total 100

101
Basic Statistics

Solution: Here,

L = 92 , S= 21

Range = L – S = 92 – 21 = 71

LS 92  21
Coefficient of Range = = = 0.62832
LS 92  21

It is a quick and dirty measure of variability, although when a test is given back to students
they very often wish to know the range of scores. Because the range is greatly affected by
extreme scores, it may give a distorted picture of the scores. Range for grouped frequency
distribution is the upper class boundary of the last class interval minus the lower class
boundary of the first class interval, i.e., R = UCBlci - LCBfci .

Merits and Demerits of range

Merits:

• It is rigidly defined.

• It is easy to calculate and simple to understand.

Demerits:

• It is not based on all observation.

• It is highly affected by extreme observations.

• It is affected by fluctuation in sampling.

• It cannot be computed in the case of open end distribution.

• It is very sensitive to the size of the sample.

102
Basic Statistics

The quartile deviation and coefficient of quartile deviation

Inter-quartile range and quartile deviation are other measures of dispersion. The difference
between the upper quartile and lower quartile is called inter-quartile range.
Symbolically,

The inter-quartile ranges covers dispersion of middle 50% of the items of the series. Quartile
deviation, also called semi-inter-quartile range is half of the difference between the upper and
lower quartile. That is, half of the inter-quartile range. Its formula as:

The relative measure of quartile deviation also called the coefficient of quartile deviation is
defined as:

Example 4.3: Find inter-quartile deviation, quartile deviation and coefficient of quartile
deviation from the following data.

28, 18, 20, 24, 27, 27, 30, 15

Solution: First arrange the data in ascending order. 25, 18, 20, 24, 27, 28, 30

103
Basic Statistics

Example 4.4: Find inter-quartile range, quartile deviation and coefficient of quartile deviation
from the following data

Marks 2 3 4 5 6 7 8 9
No. Of students 1 11 12 1 5 1 7 5
0 3 2

Solution:

Marks No. Of CF
students
2 10 10
3 11 21
4 12 33
5 13 46
6 5 51
7 12 63
8 7 70
9 5 75=N
Total N=75

104
Basic Statistics

Percentile range is defined as

IQR and Outlier

Any applied statistician who has analyzed a number of sets of real data is likely to have come
across outliers. The intuitive definition of an outlier would be ‘an observation deviates so
much from other observations as to arouse suspicious that it was generated by different
mechanism.’ That is, outliers are observations that are distinct from the main body of the data
and are incompatible with the rest of data. These values may be genuine observations from
individuals with very extreme levels of the variable.

Remark: Q.D or C.Q.D includes only the middle 50% of the observation.

IQR Rule

A simple approach to detect outlier is that print the data and visually checks them by eye. This
is suitable of the number of observations is not too large and if the potential outlier is much
lower than or higher than the rest of the data.

When the number of observations gets larger and larger, we can check the presence of outlier
by the 1.5 IQR rule. The steps to identify outliers are presented as follows:

1. Arrange the data in ascending order


2. Calculate the first, the third quartiles and inter-quartile range
3. Compute and any observation outside this range is considered as
outlier

105
Basic Statistics

Example 4.5: Consider the following data

1 2 5 5 7 8 10 11 11 12 15 25

Check the presence of outliers.

Solution: The first step is arranging the data in ascending order then let us calculate the first
and third quartile

The next step is computing

Therefore, the observation less than -4 and greater than 20 are considered as outlier. That is,
25 is outlier.

By using the concept of 1.5IQR rule, we can draw box plot which is used to give five-number
summaries. Five-number summaries contains minimum, quartile one, median, quartile three
and maximum.

Steps to draw box plot are”

1. Notice that you must have ordered data before you can find the Five – Number Summaries.
2. Find the median first. It’s the middle point
3. Then find the quartiles, Q1 and Q3 and the 1.5 IQR outlier limits

106
Basic Statistics

4. Draw a “box" from Q1 to Q3 with bars at Q1, Q3 and the median. (In the below example the
box is horizontal, but it could also be vertical.)
5. Draw a straight line from Q3 to either the largest observation or the upper
outlier bound, whichever is smaller.
6. Draw a straight line from Q1 to either the smallest observation or the lower
outlier bound, whichever is larger.
7. Any remaining observations (the outliers) are shown as individual points on the plot.

Exercise: Take the data 1, 2, 5, 5, 7, 8, 10, 11, 12, 12, 18, 25 and draw box plot

Merits of

It is simple to understand
It is easy to compute
It is well-defined
It helps in studying the middle 50% item in the series
It is not affected by the extreme items
It is useful in the case of open-ended

Demerits of

It is not based on all the items


It is not capable of further algebraic treatment
It doesn’t have sampling stability

The mean deviation and coefficient of mean deviation

The Mean Deviation (M.D): The mean deviation of a set of items is defined as the arithmetic
mean of the values of the absolute deviations from a given average. Depending up on the type
of averages used we have different mean deviations.

a) Mean Deviation about the mean

MD = .

107
Basic Statistics

For the case of a frequency distribution data where the values X1, X2, X3, …,Xm occur f1, f2,
f3, …, fm times respectively, then mean deviation is obtained by:

MD = .

For grouped data that is if the data is given in the form of frequency distribution of K-classes
in which mi and fi are the class marks and frequency of the ith class respectively then the mean

deviation is given by: MD = .

1
b. Mean deviation from median =
n
| xi  Md. |

1
c. Mean deviation from mode =
n
| xi  mod e |

 In the case of frequency distribution:

1
b’. Mean deviation from median =
n
 fi | xi  Md. |

1
c’. Mean deviation from mode =
n
 fi | xi  Mode. |

Steps to calculate M.D:

1. Find the arithmetic mean,

2. Find the deviations of each reading from and

3. Find the arithmetic mean of the deviations, ignoring sign.

Example: calculate the mean deviation for the following data:

108
Basic Statistics

Xi 10 8 9 7 6
Fi 8 9 13 6 3

Solution: first find the mean as = = (10*8 + 8*9 +…+6*3)/(8+9+…+3) = 8.4,

then

Xi 10 8 9 7 6

fi 8 9 13 6 3

│Xi - │ 1.6 0.6 0.4 1.4 2.4

fi │Xi - │ 12.8 7.8 3.6 8.4 7.2

Thus, MD = = (12.8 + …+ 7.2)/ (8+…+3) = 39.8/39 =1.02.

Interpretation: each value deviates on average 1.02 from the arithmetic mean, 8.4.

Note: You can also calculate the mean deviation about the Median and Mode.

Coefficient of Mean Deviation (C.M.D):

CMD = .

Exercise: Find all coefficients of mean deviations for the following frequency distribution:

Marks 10-20 21-30 31-40 41-50 51-60


No. Of students 4 8 20 12 6

Merits of

It is simple to understand
It is easy to compute
It is well-defined
109
Basic Statistics

It is based on all observations


It is not unduly affected by the extreme items
It can be calculated by using any average

Demerits of

It is not capable of further algebraic treatment


It does not take in to account the signs of the deviations of items from the average

Note that: of all the mean deviations taken about different averages or any arbitrary value,
the mean deviation about the median has the smallest value.

The Variance, Standard Deviation and the Coefficient of Variation

The Variance: is the "average squared deviation from the mean" and it measures the average
of the square of the deviations from the mean for each observations.

Suppose we have population of N observations, say X1, X2, X3, …, XN, then we define the
population variance as:

= = .

But most of the time we have sample of n observations, say X1, X2, X3, …, Xn from the
population of N, then we define the sample variance as:

= .

This measure of variation is universally used to show the scatter of the individual
measurements around the mean of all the measurements in a given distribution. But the
disadvantage is that the units of variance are the square of the units of the original
observations. The easiest way for this difficulty is to use the square root of the variance as a
measure of variability called the standard deviation.

110
Basic Statistics

Standard deviation is the most important and widely used measure of dispersion. It was
first used by Karl Pearson in 1893. The standard deviation of a statistical data is defined as the
positive square root of the mean of the squared deviations of items from the mean of the series
under consideration.

The population and the sample standard deviations denoted by σ and S respectively are

defined as:σ = and S = = .

For the case of frequency distribution data the population and sample variance are given as:

= and =

and the square roots of these will give the corresponding standard deviations.

Variance and Standard Deviation for Grouped Data

To obtain the variance and standard deviation of data presented in a grouped frequency
distribution, we make the same assumptions that made in the calculation of the mean for
grouped data in which each value falling in to a class is identically distributed and
observations in each class represented by the class mark. The calculation is the same to the
formula of data given in frequency distribution except that Xi is substitute by the mid points
of each class and m by k.

The following steps are used to calculate the sample variance:

1. Find the arithmetic mean.

2. Find the difference between each observation and the mean.

3. Square these differences.

4. Sum the squared differences.

111
Basic Statistics

5. Since the data is a sample, divide the number (from step 4 above) by the number of
observations minus one, (i.e., n-1), where n is the number of observations in the data set.

Example: Areas of spray able surfaces with DDT from a sample of 15houses are as follows
(m2): 101, 105, 110, 114, 115, 124, 125, 125, 130, 133, 135, 136, 137, 140, 145.Find the
variance and standard deviation of the above distribution.

Solution: The mean of the sample is 125 m2, then

S2 = = {(101-125)2 +(105-125)2 + ….(145-125)2} / (15-1) = 178.71m4

Hence, the standard deviation = S = (178.71m4)1/2 = 13.37 m2.

It implies that each spray surface of the house deviates from the mean by 13.37 m2 on
average.

Examples: Find the variance and standard deviation of the following sample data

a) 5, 17, 12, 10.


b) The data is given in the form of grouped frequency distribution.

Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3

112
Basic Statistics

Solutions: a) = 11

X 5 10 12 17 Total
i

(X - )2 36 1 1 36 74
i

Then S2 = 74/(4-1) = 24.67 and S = (24.67) ½ = 4.97

b) = 55

m (midpoint) 42 47 52 57 62 67 72 Tota
i
l

f (m - )2 1183 640 198 60 588 864 867 4400


i i

Then S2 =4400/(75-1) = 59.46 and S = (59.46) ½ = 7.71

Some Important Properties of Variance and Standard Deviation

1. Consider a sample X1, …..,Xn, which will be referred to as the original sample. To create a
translated sample X1+C, add a constant C to each data point. Let Yi = Xi+C, i = 1, …., n.
Suppose we want to compute the standard deviation of the translated sample, we can show
that the following relationship holds: If Yi = Xi + C, i = 1, …., n, then Sy = Sx.

Therefore, the standard deviation of Y will be the same as the standard deviation of X.

2. What happens to the standard deviation if the units or scales being worked with are changed?
A re-scaled sample can be created: If Yi = CXi, i=1, ……., n, then Sy = CSx and S2y = C2S2x.
Therefore, to find the variance and standard deviation of the Y’s compute the variance and
standard deviations of the X’s and multiply it by the constant C2 and C, respectively.

Example: If we have a sample of temperature in °C with a standard deviation of 1.8, then


what is the standard deviation of a sample temperature in °F?

113
Basic Statistics

Solution: Let Yi denote the °F temperature that corresponds to a °C temperature of Xi. Since
the required transformation to convert the data to °F would be: Yi = Xi + 32, i= 1, 2, 3, …, n.

Then the standard deviation in oF would be: Sy = 9/5(1.8) = 3.24 0F.

3. On the other hand, where several standard deviations for a variable are available and if we
need to compute the combined standard deviation, the pooled standard deviation (Sp) of the
entire group consisting of all the samples may be computed as:

Sp = , where ni and Si represent number of observations and

standard deviation of each single sample, respectively.

4. The value of S is usually positive and it is zero only when all of the data values are the
same. Values close together will yield a small SD, whereas values spread apart will yield a
larger SD. Also, larger values of S indicate greater amount of variation.

Example: The standard deviation of systolic blood pressure was found to be 10.6 and 15.2
mm Hg, respectively, for two groups of 12 and 15 men. What is the standard deviation of
systolic pressure of all the 27 men?

Solution: Given: Group 1: S1 = 10.6 and n1 = 12 Group 2: S2 = 15.2 and n2 = 15, then

Sp = = {(11*10.62 + 14*15.52)/(11*14)}1/2 = 13.37 mm Hg.

3.5.1.2. Relative Measures of Dispersion

Coefficient of Variation (CV):The coefficient of variation (CV) is defined by *100%. The

coefficient of variation is most useful in comparing the variability of several different


samples, each with different means. This is because a higher variability is usually expected
when the mean increases, and the CV is a measure that accounts for this variability.

The coefficient of variation is also useful for comparing the reproducibility of different
variables. CV is a relative measure free from unit of measurement.
114
Basic Statistics

Examples: An analysis of the monthly wages paid (in Birr) to workers in two firms A and B
belonging to the same industry gives the following results.

Value Firm A Firm B


Mean wage 52.5 47.5
Median wage 50.5 45.5
Variance 100 121

Solution: C.VA = *100% = 10/52.5 = 19.05% and

C.VB = *100% = 11/47.5 = 23.16%.

Since C.V < C.V in firm B there is greater variability in individual wages.
A B,

Just as it is possible to calculate combined mean of two or more groups, similarly the
combined or pooled standard deviation of two or more groups can be calculated. The
combined standard deviation of two groups is denoted by and is computed as follow:

For more than two groups (say groups)

Where is the group sample size and is the variance of the group.

Example 4.7: Two samples of size 100 and 150, respectively, have means 50 and 60 and
standard deviations 5 and 6. Find the mean and standard of the combined sample
of size 250.

Solution: Given

115
Basic Statistics

Exercise: Find the shortcut formula to find standard deviation, mean deviation and combined
deviation.

Correcting incorrect values of standard deviation

In certain cases, mean and standard deviation are calculated by using one or two incorrect
values of the variable. Just as we can correct an incorrect mean, similarly, there is a procedure
of correcting an incorrect standard deviation.

Steps in calculating the corrected standard deviation

1. Find out incorrect sum of square values of the variable. That is,

2. Find corrected . To do so, we subtract the square of the incorrect item from incorrect
and add the square of correct item to incorrect . thus,

3. Apply the following formula:

116
Basic Statistics

Relationship between measures of dispersion

is approximately equal to

is approximately equal to

Note that: These relationship hold if the distribution is moderately symmetrical

Properties of standard deviation

The important mathematical properties of standard deviation are as follow:

The standard deviation of the first natural numbers can be found from the following
formula:

For example, the standard deviation of the first 5 natural numbers is given as:

We can calculate combined standard deviation for two or more groups


If a constant amount ‘ ’ is added or subtracted from each item of a series, then remains
unaffected
If each item of a series is multiplied or divided by a constant ‘ ’ , then is affected by the
same amount
The standard deviation has the following relation to the arithmetic mean in a symmetrical
distribution:
 includes 68.27% of the observations
 includes 95.45% of the observations
 includes 99.73% of the observations
In general, for any distribution with the interval contains at least

fraction of the total number of observations.

117
Basic Statistics

Example 4.8: If the value of standard deviation in moderately symmetrical distribution is 24,
find the value of mean deviation and quartile deviation.

Solution: For moderate symmetrical series

Example 4.9: If the mean and the standard deviation of 25 boys’ weight are 50 and 5,
respectively, at least how many boys will in the interval ?

Solution:

Therefore, at least or 50% of the boys have body weight with in

the interval

Sheppard’s Correction for Variance

When the observations are grouped into classes, all observations in a class are equal to the
midpoint of the class. This introduces some error known as grouping error. Sheppard suggests
a correction known as Sheppard’s correction. It is given by where is the class width.

Merits of

It is simple to understand
It is well-defined
It is based on all items
118
Basic Statistics

It is suitable for further algebraic treatment


It has sampling stability
It is very useful in the study of “Tests of Significant”

Demerits of

It is easy to calculate
It is unduly affected by extreme values

The standard Score (Z-score)

A standard score for sample vale in a data set is obtained by the mean of the data set from the
value and dividing the result by the standard deviation of the data set. Basically, the Z-score is
the number of standard deviations that a given value X is below or above the mean and

defined as Z = (for the sample data sets) and Z = (for the population data sets).

Values above the mean have positive z-scores and values below the mean have negative Z-
scores. The numerical value of the Z-score reflects because of this Z-score is also referred to
as relative measure of relative standing. Scores are generally meaningless by themselves
unless they are compared to the distribution or scores from some reference group. In addition
to comparison the data sets it is useful to transform a given data sets in to a new distribution
and the resulting data has mean value zero and variance one which is the standard normal
distribution (we will see it in chapters of hypothesis testing).

Note: A Z-score value less than -2 and greater than 2 considers as unusual value while
between -2 and 2 is considers as ordinary values.

Properties of the Z-score

 The sum of Z-scores is always zero.


 The mean of Z-score is zero.
 The variance and standard deviation of z-score are equal to one.

Examples: 1. Two sections were given introduction to statistics examinations. The following
information was given.
119
Basic Statistics

Value Section 1 Section 2


Mean 78 90
Standard deviation 6 5

Student A from section 1 scored 90 and student B from section 2 scored 95. Relatively
speaking who performed better?

Solution: ZA = = (90-78)/6 = 2 and ZB = = (95-90)/5 = 1.

Student A performed better relative to his section because the score of student A is two
standard deviations above the mean score of his section while, the score of student B is only
one standard deviation above the mean score of his section.

2. Two groups of people were trained 100km race and tested to find out which group is faster to
complete the race. For the two groups the following information was given:

Value Group one Group two


Mean 10.4 min 11.9 min
Stan.dev. 1.2 min 1.3 min

Relatively speaking:

a) Which group is more consistent in its performance?

b) Suppose a person A from group one take 9.2 minutes while person B from Group two take 9.3
minutes, who was faster in completing the race? Why? DO!!

3.6. Moments, Skewness and Kurtosis

In describing a numerical data set it is not only necessary to summarize the data by presenting
appropriate measures of central tendency, dispersion and relative standing, it is also necessary
to consider the shape of the data – the manner, in which the data are distributed. There are two
measures of the shape of a data set: skewness and kurtosis.
120
Basic Statistics

Moments

Moments are statistical measures used to describe the characteristics of a distribution and we
can have moment about any number A and /or about the mean (called central moment).

The rth moments of the distribution about the mean is:

for ungrouped data set and

for grouped data set.

The rth moments of the distribution about A is:

for ungrouped data set and

for grouped data set.

Skewness

If the distribution of the data is not symmetrical, it is called asymmetrical or skewed.


Skewness characterizes the degree of asymmetry of a distribution around its mean.

The direction of the skewness depends upon the location of the extreme values. If the extreme
values are the larger observations, the mean will be the measure of location most greatly
distorted toward the upward direction. Since the mean exceeds the median and the mode, such
distribution is said to be positive or right-skewed. The tail of its distribution is extended to the
right.

On the other hand, if the extreme values are the smaller observations, the mean will be the
measure of location most greatly reduced. Since the mean is exceeded by the median and the
mode, such distribution is said to be negative or left-skewed. The tail of its distribution is
extended to the left.

121
Basic Statistics

Right-skewed distribution Left-skewed


distribution

For a sample data, the skewness is defined by the formula:

Sk= , where n = no of observations in the sample &s = SD of the

sample.

It is also possible to find skewness as: SK= Mean  Mode


S tan dard deviation

Properties of Skewness

 If SK = 0, then the distribution is symmetrical.


 If SK> 0, then the distribution is positively skewed.
 If SK< 0, then the distribution is negatively skewed.
 There is no theoretical limit to this measure, however, in practice the value given by this
formula falls between -3 and 3.

Kurtosis

Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the
bell-shaped distribution (normal distribution) or kurtosis is the degree of measure of
peakedness of a distribution.

122
Basic Statistics

If a distribution is very peaked than a normal distribution, then it is called Leptokurtic


distribution and if it is flat it is called Pletykurtic and if it is moderate (normal) we call it
Mesokurtic.

Kurtosis of a sample data set is calculated directly from the data by the formula:

= -

It is also possible to calculate the measure of kurtosis from the rth moment about the mean of
the sample data as: , where is the 4th moment about the mean.

Interpretation of the value of

1. If =3, then the distribution is mesokurtic.


2. If > 3, then the distribution is leptokurtic.
3. If < 3, then the distribution is platykurtic.

If we want to our reference point to be zero, we can change the above coefficient as:φ = - 3.

Accordingly, If φ =0, then the distribution is said to be mesokurtic.

If φ > 0, then the distribution is said to be leptokurtic.

If φ < 0, then the distribution is said to be platykurtic.

123
Basic Statistics

The distributions with positive and negative kurtosis

3.7. Statistical Measures of Association

Suppose we have a population consisting of observations having two attributes or qualitative


characteristics say A and B.

- If the attributes are independent then the probability of possessing both A and B is PA*PB

Where PA is the probability that a number has attribute A.

PB is the probability that a number has attribute B.

- Suppose A has r mutually exclusive and exhaustive classes.

B has c mutually exclusive and exhaustive classes

- The entire set of data can be represented using r * c contingency table.


-

B
A B1 B2 . . Bj . Bc Total
A1 O1 O12 O1j O1c R1
1

A2 O2 O22 O2j O2c R2


1

.
.
Ai Oi1 Oi2 Oij Oic Ri
.
.
Ar Or Or2 Orj Orc
1

Total C1 C2 Cj n

124
Basic Statistics

The chi-square procedure test is used to test the hypothesis of independency of two attributes.
For instance, we may be interested

 Whether the presence or absence of hypertension is independent of smoking habit or


not.
 Whether the size of the family is independent of the level of education attained by
the mothers.
 Whether there is association between demand and supply regarding products.
 Whether there is association between stability of marriage and period of
acquaintance ship prior to marriage.

- The  2 statistic is given by:

r c  (Oij  eij ) 2 
     ~  (r 1)( c 1)
2 2
cal
i 1 j 1 eij 

Where Oij  the numberof units that belongto categoryi of A and j of B.


eij  Expected frequencythat belongto categoryi of A and j of B.

The eij is given by :

Ri * C j
eij 
n

Where Ri  the i throwtotal.


C j  the j th columntotal.
n  total numberof oservations

125
Basic Statistics

Remark:

r c r c
n    Oij    eij
i 1 j 1 i 1 j 1

- The null and alternative hypothesis may be stated as:

H 0 : Thereis no association betweenA and B.


H1 : not H 0 ( Thereis association betweenA and B).

Decision Rule:

-Reject H0 for independency at  level of significance if the calculated value of

 2 exceeds the tabulated value with degree of freedom equal to (r  1)(c  1) .

r c (Oij  eij ) 2 
 Reject H 0 if        ( r 1)( c 1) at 
2 2
cal
i 1 j 1 eij 

Examples:

1. A geneticist took a random sample of 300 men to study whether there is association between
father and son regarding boldness. He obtained the following results.

Son
Father Bold Not
Bold 85 59
Not 65 91

Using   5% test whether there is association between father and son regarding boldness.

126
Basic Statistics

Solution:

H 0 : Thereis no association between Father and Son regardingboldness.


H1 : not H 0
First calculate the row and column totals

R1  144, R2  156, C1  150, C2  150

- Then calculate the expected frequencies( eij’s)

Ri * C j
eij 
n

R1 * C1 144 *150
 e11    72
n 300

R1 * C2 144 *150
e12    72
n 300

R2 * C1 156 *150
e21    78
n 300

R2 * C2 156 *150
e22    78
n 300

- Obtain the calculated value of the chi-square.

127
Basic Statistics

2 2  (Oij  eij ) 2 
 cal    
2

i 1 j 1 e 
 ij

(85  72) 2 (59  72) 2 (65  78) 2 (91  78) 2


     9.028
72 72 78 78

- Obtain the tabulated value of chi-square

  0.05
Degreesof freedom (r  1)(c  1)  1*1  1
 02.05 (1)  3.841 from table.

- The decision is to reject H0 since  2 cal   02.05 (1)

Conclusion: At 5% level of significance we have evidence to say there is association


between father and son regarding boldness, based on this sample data.

2. Random samples of 200 men, all retired were classified according to education and number of
children is as shown below

Education Number of children


level 0-1 2-3 Over 3
Elementary 14 37 32
Secondary 31 59 27
and above

Test the hypothesis that the size of the family is independent of the level of education attained
by fathers. (Use 5% level of significance)

128
Basic Statistics

Solution:

H 0 : Thereis no association between the size of the family and the level of
educationattained by fathers.
H1 : not H 0 .

- First calculate the row and column totals

R1  83, R2  117, C1  45, C2  96, C3  59

- Then calculate the expected frequencies( eij’s)

Ri * C j
eij 
n

 e11  18.675, e12  39.84, e13  24.485


e21  26.325, e22  56.16, e23  34.515

- Obtain the calculated value of the chi-square.

2  (Oij  eij ) 2 
3
 cal    
2

i 1 j 1 e 
 ij

(14  18.675) 2 (37  39.84) 2 (27  34.515) 2


   ...   6.3
18.675 39.84 34.515

- Obtain the tabulated value of chi-square

  0.05
Degreesof freedom (r  1)(c  1)  1* 2  2
 02.05 (2)  5.99 from table.

129
Basic Statistics

- The decision is to reject H0 since  2 cal   02.05 (2)

Conclusion: At 5% level of significance we have evidence to say there is association


between the size of the family and the level of education attained by fathers, based on this
sample data.

EXERCISES:

Part 1: choose the best answer

1. Which of the following measure is most influenced by the outlier?


A. Median B. Mean C. Mode D. Quartile
2. The sum of the deviation of individual values from their arithmetic mean is always
A. Positive B. Zero C. Negative D. indeterminate
1 4
3. Geometric mean of two values and is equal
16 25
A. 1/100 B. 100 C. 1/10 D. 10 E. none
4. One of the following is not computed for open end class interval
A. Median B. Mean C. Mode D. Quartile
5. One of the following measures of dispersion is not used to compare the degree of variability
among different series. Which one?
A. Standard deviation B. Coefficient of range C. Coefficient of variation D. Mean
deviation
6. Which one of the following is not the feature of the standard deviation?
A. It is based on squared deviation from the arithmetic mean
B. It is expressed in the same units as the mean
C. Its calculation employs the whole of values in the series
D. It is not relative measure
E. None
7.The right expression for variance is
A. Variance is the square of the standard deviation
B. Variance is the product of arithmetic mean and coefficient of variation
130
Basic Statistics

C. The positive square root of standard deviation is also variance


D. A and C
8. Suppose that, frequencies are highly concentrated or spread out the right of the center in a
given distribution. Which of the following expression best expresses the case?
A. The distribution is negatively skewed B. Mean = Mode = Median C. none
For positively skewed distribution, extreme value lies in the:
A. Anywhere B. right tail C. left tail D. the middle

Part II: Answer the following questions

1. The accompanying data describe the hourly wage rates (dollars per hour) for 30 employees
of an electronics firm:
22.66 24.39 17.31 21.02 21.61 20.97 18.58 16.61
19.74 21.57 20.56 22.16 20.16 18.97 22.64 19.62
22.05 22.03 17.09 24.60 23.82 17.80 16.28 19.34
22.22 19.49 22.27 18.20 19.29 20.43
Construct a frequency distribution, calculate all of the measures of central tendencies and
measures of dispersions.
2. For 75 employees of a large department store, the following distribution for years of service
was obtained. Calculate the following
Class limits Frequency
1–5 21
6–10 25
11–15 15
16–20 0
21–25 8
26–30 6
a. Mean, median, mode of the data
b. Variance, standard deviation, coefficient variation
c. Q2, D5, P50

131
Basic Statistics

3. The monthly income of sample employees of Jimma


University is given below

No. of 12 8 17 6 2 23 13 9 5 14
employees
Income (in 1.2 4 1.5 3 13 7 9 6 6 10
“000” ETB)
a. Determine the number of employees involved in the survey
b. Calculate the median income for the employees
c. Calculate the first, second and third quartiles using the data
d. Calculate the standard deviation, mean deviation and coefficient of mean deviation

4. The salaries (in millions of dollars) for 31 NFL teams for a specific season are given in this
frequency distribution.
Class limits Frequency
39.9–42.8 2
42.9–45.8 2
45.9–48.8 5
48.9–51.8 5
51.9–54.8 12
54.9–57.8 5
a. Mean, median, mode of the data
b. Inter quartile range (IQR)
c. Variance, standard deviation, coefficient variation
d. Q2, D5, P50
e. Kurtosis and skewness
f. What can you conclude about the shape of the data?

132
Basic Statistics

5. An insurance company researcher conducted a survey on the number of car thefts in a large
city for a period of 30 days last summer. The raw data are shown. Construct a stem and leaf
plot by using classes 50–54, 55–59, 60–64, 65–69, 70–74, and 75–79.
52 62 51 50 69
58 77 66 53 57
75 56 55 67 73
79 59 68 65 72
57 51 63 69 75
65 53 78 66 55
a. Mean, median, mode of the data
b. Quartile deviation and Inter quartile range (IQR)
c. Variance, standard deviation
d. Q3, D9, P75
e. Kurtosis and skewness
f. What can you conclude about the shape of the data?
6. Find the missing information from the following data.

Group 1 Group 2 All group


mean 55 70 60
Sample size 100 ? 150
Standard 15 10 ?
deviation
7. Random samples of 10 boys are selected from the population of a certain camp, and each
boy’s weight and height are measured and recorded. The average weight of boys in the
sample is 32.66 kg with a standard deviation of 3.9 kg and the average height is 95.5cm with a
standard deviation of 5.2cm. Is measurement of weight or height has less variable?
8. Assume that these class students have pulse rate with mean of 72.9 and a standard deviation
of 12.3. When you are doing this question, your pulse rate is 48. Calculate your Z-score and
what you conclude from this for yourself?

133
Basic Statistics

9. Consider the following grouped frequency distribution and find: (i) the variance (ii)
coefficient of variation (iii) coefficient of skewness (iv) coefficient of kurtosis.

Class (cm) 11- 14- 17- 20-22 23-


13 16 19 25
11 20 30 15 4

10. The cost of consumer purchases such as single-family housing, gasoline, Internet services, tax
preparation, and hospitalization were provided in The Wall-Street Journal (January 2, 2007).
Sample data typical of the cost of tax-return preparation by services such as H&R
Block are shown below.
120 230 110 115 160
130 150 105 195 155
105 360 120 120 140
100 115 180 235 255
a. Compute the mean, median, and mode.
b. Compute the first and third quartiles.
c. Compute and interpret the 90th percentile?

134
Basic Statistics

CHAPTER FOUR: PROBABILITY AND PROBABILITY DISTRIBUTION

The present lesson is an attempt to overview the concept of probability, thereby enabling the
students to appreciate the relevance of probability theory in decision-making under conditions
of uncertainty. After successful completion of the lesson the students will be able to
understand and use the different approaches to probability as well as different probability
rules for calculating probabilities in different situations.

The overall objective of this lesson is to discuss the concept of probability, counting
techniques, approaches to probability, random variable and probability distributions. After
successful completion of the lesson the students will be able to appreciate the usefulness of
Probability, counting techniques, probability distributions in decision-making and also
identify situations where Binomial, Poisson, exponential, and Normal probability distributions
can be applied.

4.1. Basic definitions of probability

Life is full of uncertainties. ‘Probably’, ‘likely’, ‘possibly’, ‘chance’ etc. is some of the most
commonly used terms in our day-to-day conversation. All these terms more or less convey the
same sense - “the situation under consideration is uncertain and commenting on the future
with certainty is impossible”. Decision-making in such areas is facilitated through formal and
precise expressions for the uncertainties involved. For example, product demand is uncertain
but study of demand spelled out in a form amenable for analysis may go a long to help
analyze, and facilitate decisions on sales planning and inventory management. Intuitively, we
see that if there is a high chance of a high demand in the coming year, we may decide to stock
more. We may also take some decisions regarding the price increase, reducing sales expenses
etc. to manage the demand. However, in order to make such decisions, we need to quantify
the chances of different quantities of demand in the coming year. Probability theory provides
us with the ways and means to quantify the uncertainties involved in such situations.

A probability is a quantitative measure of uncertainty - a number that conveys the strength


of our belief in the occurrence of an uncertain event.

135
Basic Statistics

Since uncertainty is an integral part of human life, people have always been interested -
consciously or unconsciously - in evaluating probabilities. Having its origin associated with
gamblers, the theory of probability today is an indispensable tool in the analysis of situations
involving uncertainty. It forms the basis for inferential statistics as well as for other fields that
require quantitative assessments of chance occurrences, such as quality control, management
decision analysis, and almost all areas in physics, biology, engineering and economics or
social life.

In general

 Probability theory is the foundation upon which the logic of inference is built.
 It helps us to cope up with uncertainty.
 Probability is the chance of an outcome of an experiment. It is the measure of how likely an
outcome is to occur.

4.2. Fundamental concepts: experiment and event, event and their relationships, conditional
and joint probability

Fundamental concepts and Definitions of some probability terms

Probability, in common parlance, refers to the chance of occurrence of an event or happening.


In order that we are able to compute it, a proper understanding of certain basic concepts in
probability theory is required. These concepts are an experiment, a sample space, and an
event.

1. Experiment: Any process of observation or measurement or any process which


generates well defined outcome. An outcome of an experiment is some observation or
measurement. The term experiment is used in probability theory in a much broader sense than
in physics or chemistry.

Any action, whether it is the drawing a card out of a deck of 52 cards, or reading the
temperature, or measurement of a product's dimension to ascertain quality, or the launching of
a new product in the market, constitute an experiment in the probability theory terminology.

136
Basic Statistics

The experiments in probability theory have three things in common:

 There are two or more outcomes of each experiment


 It is possible to specify the outcomes in advance
 There is uncertainty about the outcomes.

For example, the product we are measuring may turn out to be undersize or right size or
oversize, and we are not certain which way it will be when we measure it. Similarly,
launching a new product involves uncertain outcome of meeting with a success or failure in
the market. A single outcome of an experiment is called a basic outcome or an elementary
event. Any particular card drawn from a deck is a basic outcome.

2. Outcome : The result of a single trial of a random experiment

Example Experiment Outcomes

Tossing of a fair coin Head, tail

Rolling a die 1, 2, 3, 4, 5, 6

Selecting an item from a production lot good, bad

Introducing a new product Success, failure

3. Sample Space: Set of all possible outcomes of a probability experiment .The


sample space is the universal set S pertinent to a given experiment. It is the set of all possible
outcomes of an experiment. So each outcome is visualized as a sample point in the sample
space.

Example

Experiment Sample Space


Drawing a Card {all 52 cards in the deck}
Reading the Temperature {all numbers in the range of
temperatures}

137
Basic Statistics

Measurement of a Product's {undersize, outsize, right size}


Dimension
Launching of a New Product {success, failure}

4. Event: An event, in probability theory, constitutes one or more possible outcomes


of an experiment. An event is a subset of a sample space. It is a set of basic outcomes. We say
that the event occurs if the experiment gives rise to a basic outcome belonging to the event. It
is a subset of sample space. It is a statement about one or more outcomes of a Random
experiment .They are denoted by capital letters.

Example1: Considering the experiment of rolling of a ;die let A be the event of odd numbers,
B be the event of even numbers, and C be the event of number 8.

 A  1,3,5
B  2,4,6
C    or empty spaceor impossibleevent

Remark: If S (sample space) has n members then there are exactly 2n subsets or events.

Example 2: For the experiment of drawing a card, we may obtain different events A, B, and
C like:

A: The event that card drawn is king of club

B: The event that card drawn is red

C: The event that card drawn is ace

In the first case, out of the 52 sample points that constitute the sample space, only one sample
point or outcome defines the event, whereas the number of outcomes used in the second and
third case is 13 and 4 respectively.

5. Equally Likely Events: Events which have the same chance of occurring.
6. Elementary Event: an event having only a single element or sample point.

138
Basic Statistics

7. Composite (compound) event is an event having two or more elementary events in


it. For example, rolling a die sample space = {1, 2, 3, 4, 5, 6) an event having {5} is simple
event where as having even number = {2, 4, 6} is compound (composite) event.
8. Complement of an Event: the complement of an event A means non-occurrence
of A and is denoted by A' , or Ac , or A contains those points of the sample space which
don’t belong to A. The Rule of Complements defines the probability of the complement of
an event in terms of the probability of the original event. Consider event A defined over the
sample space S. The Complement of set A, denoted by A' , or Ac , or A is a subset, which
contains all outcomes, which do not belong to A

In other words A + = SA

So P (A + ) = P(S)

or P(A) + P( )=1

or P( A ) = 1 - P( )

As a simple example, if the probability of rain tomorrow is 0.3, then the probability of no rain
tomorrow must be 1 - 0.3 = 0.7.

If the probability of drawing a king is 4/52, then the probability of the drawn card's not being
a King is 1 - 4/52 = 48/52.

9. Elementary Event: an event having only a single element or sample point.


10. Mutually Exclusive Events: Two events which cannot happen at the same time.
Two events are said to be mutually exclusive, if both events cannot occur at the same time as
outcome of a single experiment. In other word two events E 1 and E 2 said to be mutually

exclusive events if there is no sample point in common to both events E 1 and E 2 . For
example, if we roll a fair dice, then the experiment is rolling the dice and Sample space (S) is

S = {1, 2, 3, 4, 5, 6}

If we are interested the outcome of event E 1 getting even numbers and E 2 odd numbers

139
Basic Statistics

E 1 = {2, 4, 6}, E 2 = {1, 3, 5}

Clearly E 1  E 2 =  . Thus E 1 and E 2 are mutually exclusive events.

11. Independent Events: Two events are independent if the occurrence of one does
not affect the probability of the other occurring. Two events A and B are said to be
independent events if the occurrence of event A has no influence (bearing) on the occurrence
of event B. For example, if two fair coins are tossed, then the result of one toss is totally
independent of the result of the other toss. The probability that a head will be the outcome of
any one toss will always be ½, irrespective of whatever the outcome is of the other toss.
Hence, these two events are independent. On the other hand, consider drawing two cards from
a pack of 52 playing cards. The probability that the second card will be an ace would depend
up on whether the first card was an ace or not. Hence these two events are not independent
events.

Another example a bag contains balls of two different colours say yellow and white. Two
balls are drawn successively. First ball is drawn from a bag and replaced after notes its colour.
Let us assume that it is yellow and denote this event by A. Another ball is drawn from the
same bag and its colour is noted let this event denoted by B. Clearly, the result of first draw
has no effect on the result of the second draw. Hence, the events A and B are independent
events.

12. Dependent Events: Two events are dependent if the first event affects the outcome
or Occurrence of the second event in a way the probability is changed.

Example: .What is the sample space for the following experiment

a) Toss a die one time.


b) Toss a coin two times.
c) A light bulb is manufactured. It is tested for its life length by time.

Solution

a) S={1,2,3,4,5,6}

140
Basic Statistics

b) S={(HH),(HT),(TH),(TT)}
c) S={t /t≥0}
 Sample space can be
 Countable ( finite or infinite)
 Uncountable.

4.3. Review of set theory

Definition 4.3:

Set is a collection of well-defined objects. These objects are called elements. Sets usually
denoted by capital letters and elements by small letters. Membership for a given set can be
denoted by  to show belongingness and  to say not belong to the set.

Description of sets: Sets can be described by any of the following three ways. That is the
complete listing method (all element of the set are listed), the partial listing method (the
elements of the set can be indicated by listing some of the elements of the set) and the set
builder method (using an open proposition to describe elements that belongs to the set).

Example 1.2: The possible outcomes in tossing a six side die

S = {1, 2, 3, 4, 5, 6} or S = {1, 2, . . ., 6} or S = {x: x is an outcome in tossing a six side die}

Types of set

Universal set: is a set that contains all elements of the set that can be considered the objects of
that particular discussion.

Empty or null set: is a set which has no element, denoted by {} or 

Finite set: is a set which contains a finite number of elements. (eg.{x: x is an integer, 0 < x <
5})

Infinite set: is a set which contains an infinite number of elements. (eg. {x : x   , x > 0})

141
Basic Statistics

Sub set: If every element of set A is also elements of set B, set A is called sub sets of B, and
denoted by A  B.

Proper subset: For two sets A and B if A is subset of B and B is not sub set of A, then A is
said to be a proper subset of B. Denoted by A  B.

Equal sets: two sets A and B are said to be equal if elements of set A are also elements of set
B.

Equivalent sets: Two sets A and B are said to be equivalent if there is a one to one
correspondence between elements of the two sets.

Set Operation and their Properties

There are many ways of operating two or more set to get another set. Some of them are
discussed below.

Union of sets: The union of two sets A and B is a set which contains elements which belongs
to either of the two sets. Union of two sets denoted by  , A  B (A union B).

Intersection of sets: The intersection of two sets A and B is a set which contains elements
which belongs to both sets A and B. Intersection of two sets denoted by  , A  B (A
intersection B).

Disjoint sets: are two sets whose intersection is empty set.

Absolute complement or complement: Let U is the universal set and A be the subset of U,
then the complement of set A is denoted by A` is a set which contains elements in U but does
not belong in A.

Relative complement (or differences): The difference of set A with respected to set B,
written as A\B (or A – B) is a set which contain elements in A that doesn`t belong in B.

142
Basic Statistics

Symmetric difference: of two sets A and B denoted by A  B is a set which contain elements
which belong in A but not in B and contain elements which belong in B but not in A. That is,
A  B is a set which equals to (A\B)  (B\ A).

Basic Properties of the Set Operations

Let U be the universal set and sets A, B, C are sets in the universe, the following properties
will hold true.

1. A  B = B  A (Union of sets is commutative)


2. A  (B  C) = (A  B)  C = A  B  C (Union of sets is associative)
3. A  B = B  A (Intersection of sets is commutative)
4. A  (B  C) = (A  B)  C = A  B  C (Intersection of sets is associative)
5. A  (B  C) = (A  B)  (A  C) (union of sets is distributive over Intersection)
6. A  (B  C) = (A  B)  (A  C) (Intersection of sets is distributive over union)
7. A – B = A \ B = A  B`
8. If A  B, then B`  A` or if A  B then B`  A
9. A   = A and A   = 
10. A  U = U and A  U = A
11. (A  B)` = A`  B` De Morgan’s first rule
12. (A  B)` = A`  B` De Morgan’s
second rule
13. A = (A  B)  (A  B`)

Corresponding statement in set theory and probability


Set theory Probability theory
Universal set, U Sample space S, sure event
Empty set  Impossible event
Elements a, b,… Sample point a, b, c… (Or simple events)
Set A, B, C, . . Event A, B, C, . .
A Event A occur
A` Event A doesn`t occur
143
Basic Statistics

A  B At least one of event A and B occur


A  B Both event A and B occur
A B The occurrence of A necessarily implies the occurrence of B

B= A and B are mutually exclusive (That is, they cannot occur
simultaneously)

In many problems of probability, we are interested in events that are actually combinations of
two or more events formed by unions, intersections, and complements. Since the concept of
set theory is of vital importance in probability theory, we need a brief review.

The union of two sets A and B, A  B, is the set with all elements in A or B or both.

The intersection of A and B, A  B, is the set that contains all elements in both A & B.

The complement of A, Ac, is the set that contains all elements in the universal set  that are
not found in A. Some similarities between notions in set theory and that of probability theory
are:

In probability Theory In set


Theory
i. Event A or Event B A B

ii. Event A and Event B A B


iii. Event A is impossible A 
iv. Event A is certain A
v. Events A and B are A B  
mutually exclusive

Again, using Venn-diagram, one can easily verify the following relationships:
1. A  B  ( A  B)  ( A  B)  (B  A), noting that the three are mutually exclusive;
A  B  A  B and B  A'  B  A.

2. A  B  A  B  A, again mutually exclusive.

3. A  ( A  B)  ( A  B) and B  (B  A' )  ( A  B) .

144
Basic Statistics

Finite, infinite sample space and equally likely outcomes

If a sample space has finite number of points, it is called a finite sample space. If it has as
many point as natural numbers1, 2, 3,…it is called a countable infinite sample space. If it has
as many point as there are in some interval, such as 0 <x< 1, it is called a non-countable
infinite sample space. A sample space which is finite or countable infinite is often called a
discrete sample space while a set which is non-countable infinite is called continuous sample
space.

Equally Likely Outcomes

Equally likely outcomes are outcomes of an experiment which has equal chance (equally
probable) to appear. In most cases it is commonly assumed finite or countable infinite sample
space is equally likely. If we have n equally likely outcomes in the sample space then the
probability of the ith sample point xi is p (xi) =1/n, where xi can be the first, second,... or the
nth outcome.

Example: In an experiment tossing a fair die, the outcomes are equally likely (each outcome
is equally probable. Hence, P (xi = 1) = P (xi = 2) = P (xi = 3) = P (xi = 4) = P (xi = 5) = P (xi
= 6) =1/6

4.4. Counting Rules

In order to calculate probabilities, we have to know


 The number of elements of an event
 The number of elements of the sample space.
That is in order to judge what is probable, we have to know what is possible. If there are n
events and event i can occur in Ni possible ways, then the number of ways in which the
sequence of n events may occur is

N1. N2. N3.……….Nn

If the number of possible outcomes in an experiment is small, it is relatively easy to list and
count all possible events. When there are large numbers of possible outcomes an enumeration
of cases is often difficult, tedious, or both. Therefore, to overcome such problems one can use
various counting techniques or rules.

145
Basic Statistics

In order to determine the number of outcomes, one can use several rules of counting.

- The addition rule


- The multiplication rule
- Permutation rule
- Combination rule

To list the outcomes of the sequence of events, a useful device called tree diagram is used.

1. ADDITION RULE

Suppose that a procedure designated by 1, can be performed in n 1 ways. Assume that second

procedure designated by 2 can be performed in n 2 ways. Suppose furthermore that it is not


possible both procedures 1 and 2 are performed together. The number of ways in which we
can perform 1 or 2 procedures is n 1 + n 2 ways. This can be generalized as follows if there are

k procedures and i th procedure may be performed in n i ways, i=1, 2, …, k , then the number
k
of ways in which we perform procedure 1 or 2 or … or k is given by n 1 +n 2 +…+ n k =  ni ,
i 1

assuming that no two procedures performed together.

Example1: - Suppose that we are planning a trip and are deciding between bus and train
transportation. If there are 3 bus routes and 2 train routes to go from A to B, find the available
routes for the trip. There are 3+2 = 5 possible routes for someone to go from A to B.

Example2: A student goes to the nearest snack to have a breakfast. He can take tea, coffee,
or milk with bread, cake and sandwich. How many possibilities does he have?

Solutions:
Tea with Bread
With Cake
With Sandwich
Coeffee with Bread
With Cake
with Sandwich
milk with Bread
146
Basic Statistics

With Cake
With Sandwich

 There are nine possibilities.

The Multiplication Rule:

If a choice consists of k steps of which the first can be made in n1 ways, the second can be
made in n2 ways, …, the kth can be made in nk ways, then the whole choice can be made in

(n1 * n2 *........* nk ) ways.

Example1: The digits 0, 1, 2, 3, and 4 are to be used in 4 digit identification card. How many
different cards are possible if a) Repetitions are permitted.

b) Repetitions are not permitted.

Solutions

a)

1st digit 2nd digit 3rd digit 4th


digit
5 5 5 5

There are four steps

1. Selecting the 1st digit, this can be made in 5 ways.


2. Selecting the 2nd digit, this can be made in 5 ways.
3. Selecting the 3rd digit, this can be made in 5 ways.
4. Selecting the 4th digit, this can be made in 5 ways.

 5 * 5 * 5 * 5  625 different cards are possible.


147
Basic Statistics

b)

1st digit 2nd digit 3rd digit 4th


digit
5 4 3 2

There are four steps

1. Selecting the 1st digit, this can be made in 5 ways.


2. Selecting the 2nd digit, this can be made in 4 ways.
3. Selecting the 3rd digit, this can be made in 3 ways.
4. Selecting the 4th digit, this can be made in 2 ways.

 5 * 4 * 3 * 2  120 different cards are possible.

Example2: -An airline has 6 flights from A to B, and 7 flights from B to C per day. If the
flights are to be made on separate days, in how many different ways can the airline offer from
A to C?

Solution: In operation 1 there are 6 flights from A to B, 7 flights are available to make flight
from B to C. Altogether there are 6*7 = 42 possible flights from A to C.

Example3: - suppose that in a medical study patients are classified according to their blood
type as A, B , AB, and O; according to their RH factors as + or - and according to their
blood pressure as high, normal or low ,then in how many different ways can a patient be
classified ?

Solution

The 1st classification has done in 4 ways; the 2nd in 2 ways, and the 3rd in 3 ways. Thus,
patient can be classified in 4*2*3 = 24 different ways.

Example4:- Suppose that a bank has two branches, each branch has two departments, and
each department has four employees. Then there are (2)(2)(4) choices of employees, and the
148
Basic Statistics

probability that a particular one will be randomly selected is 1/(2)(2)(4) = 1/16. We may view
the choice as done sequentially: First a branch is randomly chosen, then a department within
the branch, and then the employee within the department.

Permutation

An arrangement of n objects in a specified order is called permutation of the objects.

Permutation Rules:

1. The number of permutations of n distinct objects taken all together is n!

Where n! n * (n  1) * (n  2) *.....* 3 * 2 *1

2. The arrangement of n objects in a specified order using r objects at a time is called the

permutation of n objects taken r objects at a time. It is written as n Pr and the formula is

n!
P
n r
(n  r )!

3. The number of permutations of n objects in which k1 are alike k2 are alike etc is

n!

k1!*k2 * ... * kn

Example1:

1. Suppose we have a letters A,B, C, D


a) How many permutations are there taking all the four?
b) How many permutations are there if two letters are used at a time?
2. How many different permutations can be made from the letters in the word
“CORRECTION”?

149
Basic Statistics

Solutions: 1. a)

Here n  4, thereare four disnict object


 There are 4! 24 permutations.

b)

Here n  4, r  2
4! 24
 There are 4 P2    12 permutations.
(4  2)! 2

Heren  10
Of which 2 areC , 2 areO, 2 are R ,1E,1T ,1I ,1N
 K1  2, k 2  2, k3  2, k 4  k5  k 6  k7  1
U sin g the 3rd ruleof permutation , thereare
10!
 453600 permutations.
2!*2!*2!*1!*1!*1!*1!

Example2: -Jimma University Registrar Office wants to give identity number for students by
using 4 digits. The number should be considered by the following numbers only: {0, 1, 2, 3, 4,
5, and 6}. Hence, how many different ID Numbers could be given by the Registrar?

a. Without repeating the number


b. With repetition of numbers

Solution

We have 7 possible numbers for 4 digits. But the required number of digits for ID number is
4. Hence n=7 & r = 4. The possible number of Id.No. Given for student without repeating
the number is

n!
nPr, =
n  r !

7!
 = 7*6*5*4 = 840.
7  4!
150
Basic Statistics

 The possible number of ID.No. given for student with repeating the number is

nr = 74 = 7*7*7*7 = 2401

Exercises:

Six different statistics books, seven different physics books, and 3 different Economics books
are arranged on a shelf. How many different arrangements are possible if;

i. The books in each particular subject must all stand together


ii. Only the statistics books must stand together

Combination

A selection of objects without regard to order is called combination.

Example: Given the letters A, B, C, and D list the permutation and combination for selecting
two letters.

Solutions:

Combination

AB BA CA DA AB BC
AC BC CB DB
AD BD CD DC AC BD
AD DC

Note that in permutation AB is different from BA. But in combination AB is the same as BA.

Combination Rule

The number of combinations of r objects selected from n objects is denoted by

 n
C
n r or   and is given by the formula:
r
151
Basic Statistics

 n n!
  
 r  (n  r )!*r!

Example1:

1. In how many ways a committee of 5 people is chosen out of 9 people?

Solutions:

n9 , r 5
 n n! 9!
     126 ways
 
r ( n  r )!*r! 4!* 5!

2. Among 15 clocks there are two defectives .In how many ways can an inspector chose three of
the clocks for inspection so that:
a) There is no restriction.
b) None of the defective clock is included.
c) Only one of the defective clocks is included.
d) Two of the defective clock is included.

Solutions: n=15 of which 2 are defective and 13 are non-defective; and r=3

a) If there is no restriction select three clocks from 15 clocks and this can be done in :

n  15 , r  3
 n n! 15!
     455 ways
 
r ( n  r )!*r! 12!*3!

b) None of the defective clocks is included.

This is equivalent to zero defective and three non-defective, which can be done in:

152
Basic Statistics

 2  13
  *    286 ways.
 0  3 

c) Only one of the defective clocks is included.


This is equivalent to one defective and two non-defective, which can be done in:

 2  13
  *    156 ways.
1  2 
d) Two of the defective clock is included.
This is equivalent to two defective and one non-defective, which can be done in:
 2  13
  *    13 ways.
 2  3 
 3  3!
Example2: - The number of combinations of letters a, b& c taken two at a time is   = =
 2  2!1!
3.
These are ab, ac and bc. Note that ab is the same combination as ba, but not the same
permutation.

Example3: - Suppose in the box 3 red, 3 white and 5 black equal sized balls are there. We
want to draw 3 balls at a time. How many ways do we have from each type?

 3  3  5 
→ Solution     = 3(3)5 = 45 ways.
 1  1  1 

Exercises:

1. Out of 5 Mathematician and 7 Statistician a committee consisting of 2 Mathematician and 3


Statistician is to be formed. In how many ways this can be done if
a) There is no restriction
b) One particular Statistician should be included
c) Two particular Mathematicians cannot be included on the committee.
2. If 3 books are picked at random from a shelf containing 5 novels, 3 books of poems, and a
dictionary, in how many ways this can be done if

153
Basic Statistics

a) There is no restriction.
b) The dictionary is selected?
c) 2 novels and 1 book of poems are selected?

4.5. Approaches to measuring Probability

There are four different conceptual approaches to the study of probability theory. These are:

 The classical approach.


 The frequentist approach.
 The axiomatic approach.
 The subjective approach.

The classical approach

This approach is used when:

- All outcomes are equally likely.


- Total number of outcome is finite, say N.

Definition: If a random experiment with N equally likely outcomes is conducted and out of
these NA outcomes are favorable to the event A, then the probability that event A occur

denoted P(A) is defined as:

N A No. of outcomes favourableto A n( A)


P( A)   
N Total number of outcomes n(S )

Examples:

1. A fair die is tossed once. What is the probability of getting


a) Number 4?
b) An odd number?
c) An even number?
d) Number 8?
154
Basic Statistics

Solutions:

First identify the sample space, say S

S  1, 2, 3, 4, 5, 6
 N  n( S )  6

a) Let A be the event of number 4 c) Let A be the event of even numbers

A  4 A  2,4,6
 N A  n( A)  1  N A  n( A)  3
n( A) n( A)
P( A)  1 6 P( A)   3 6  0.5
n(S ) n( S )

b) Let A be the event of odd numbers d) Let A be the event of number 8

A  1,3,5 A  {}
 N A  n( A)  3  N A  n( A)  0
n( A) n( A)
P( A)   3 6  0.5 P( A)  0 60
n( S ) n(S )

2. A box of 80 candles consists of 30 defective and 50 non defective candles. If 10 of this


candles are selected at random, what is the probability that

a) All will be defective.


b) 6 will be non-defective
c) All will be non-defective

Solutions:

 80 
Total selection     N  n( S )
10 

a) Let A be the event that all will be defective.

155
Basic Statistics

 30  50
Total way in which A occur    *    N A  n( A)
10   0 
 30  50
  * 
n( A) 10   0 
 P( A)    0.00001825
n(S )  80
 
10 

b) Let A be the event that 6 will be non-defective.

 30  50
Total way in which A occur    *    N A  n( A)
4 6
 30  50
  * 
n( A)  4   6 
 P( A)    0.265
n(S )  80
 
10 

c) Let A be the event that all will be non-defective.

 30  50
Total way in which A occur    *    N A  n( A)
 0  10 
 30  50
  * 
n( A)  0  10 
 P( A)    0.00624
n(S )  80
 
10 

3: -In a given basket there is 3 yellow, 4 black and 3 white balls. What is the probability of
selection of one black ball?

Solution: Let event A drawing of black ball

favorable cases to A 4
P (A) = = = 0.4
exhaustive No. of cases 10

156
Basic Statistics

Exercises:

1. What is the probability that a waitress will refuse to serve alcoholic beverages to only three
minors if she randomly checks the I.D’s of five students from among ten students of which
four are not of legal age?

 Short coming of the classical approach:


This approach is not applicable when:

- The total number of outcomes is infinite.


- Outcomes are not equally likely.

The Frequentist Approach

The classic definition of probability has a disadvantage in that of the words “equally likely” is
vague. In fact, since these words seem to be synonymous with “equally probable”, the
definition is circular because we are essentially defining probability in terms of itself.

For this reason, a statistical definition of probability has been advocated by some people.
According to this the estimated probability, or empirical probability, of an event is taken to be
the relative frequency of occurrence of the event when the number of observations is very
large. The probability itself is the limit of the relative frequency as the number of observations
increases indefinitely.

Definition: The probability of an event A is the proportion of outcomes favorable to A in the


long run when the experiment is repeated under same condition.

NA
P( A)  lim
N  N

Example1: If records show that 60 out of 100,000 bulbs produced are defective. What is the
probability of a newly produced bulb to be defective?

157
Basic Statistics

Solution: Let A be the event that the newly produced bulb is defective.

NA 60
P( A)  lim   0.0006
N  N 100,000

Example2: -If 1000 tosses of a coin result in 529 heads, the relative frequency of heads is
529/1000 = 0.529. If another 1000 tosses results in 493 heads, the relative frequency in the
529  493
total of 2000 tosses is =0.511.
2000

According to the statistical definition, by counting in this manner we should ultimately get
closer and closer to a number that represents the probability of a head in a single toss of the
coin. From the results so far presented, this should be 0.5 to one significant figure.

This is based on the relative frequencies of outcomes belonging to an event

Axiomatic Approach:

Let E be a random experiment and S be a sample space associated with E. With each event A
a real number called the probability of A satisfies the following properties called axioms of
probability or postulates of probability.

1. P( A)  0
2. P(S )  1, S is the sure event.
3. If A and B are mutually exclusive events, the probability that one or the other occur equals the

sum of the two probabilities. i.e. P( A  B)  P( A)  P( B)


4. If A and B are independent events, the probability that both will occur is the product of the
two probabilities. i.e. P(A ∩ B) = P(A)*P(B)

5. P( A' )  1  P( A)
6. 0  P( A)  1
7. P(ø) =0, ø is the impossible event.

158
Basic Statistics

Remark: Venn-diagrams can be used to solve probability problems.

AUB A∩B

In general p( A  B)  p( A)  p( B)  p( A  B)

4.6. Conditional probability and Independency

Conditional Events: If the occurrence of one event has an effect on the next occurrence of
the other event then the two events are conditional or dependent events.

Example: Suppose we have two red and three white balls in a bag

1. Draw a ball with replacement

Since the first drawn ball is replaced for a second draw it doesn’t affect the second draw. For
this reason A and B are independent. Then if we let

2
A= the event that the first draw is red p( A) 
5

2
B= the event that the second draw is red  p ( B) 
5

2. Draw a ball without replacement

This is conditional b/c the first drawn ball is not to be replaced for a second draw in that it
does affect the second draw. If we let

2
A= the event that the first draw is red p( A) 
5

159
Basic Statistics

B= the event that the second draw is red  p( B)  ?

Let B= the event that the second draw is red given that the first draw is red P(B) = 1/4

Conditional probability of an event

The conditional probability of an event A given that B has already occurred, denoted by

p ( A B) is

p( A  B)
p ( A B) = , p( B)  0
p( B)

Remark: (1) p( A' B)  1  p( A B)

(2) p( B' A)  1  p( B A)

Examples

1. For a student enrolling at freshman at certain university the probability is 0.25 that he/she will
get scholarship and 0.75 that he/she will graduate. If the probability is 0.2 that he/she will get
scholarship and will also graduate. What is the probability that a student who get a
scholarship graduate?

Solution: Let A= the event that a student will get a scholarship

B= the event that a student will graduate

given p( A)  0.25, p( B)  0.75, p A  B   0.20


Re quired pB A
p A  B  0.20
pB A    0.80
p A 0.25

160
Basic Statistics

2. If the probability that a research project will be well planned is 0.60 and the probability that it
will be well planned and well executed is 0.54, what is the probability that it will be well
executed given that it is well planned?

Solution; Let A= the event that a research project will be well

Planned

B= the event that a research project will be well

Executed

given p( A)  0.60, p A  B   0.54


Re quired pB A
p A  B  0.54
pB A    0.90
p A 0.60

Exercise: A lot consists of 20 defective and 80 non-defective items from which two items are
chosen without replacement. Events A & B are defined as A = the first item chosen is
defective, B = the second item chosen is defective

a) What is the probability that both items are defective?


b) What is the probability that the second item is defective?

Note: for any two events A and B the following relation holds.


pB  pB A. p A  p B A' . p A'   
Probability of Independent Events

Two events A and B are independent if and only if p A  B   p A. pB 

Here p A B  p A, PB A  pB 

161
Basic Statistics

Example; A box contains four black and six white balls. What is the probability of getting
two black balls in drawing one after the other under the following conditions?

a. The first ball drawn is not replaced


b. The first ball drawn is replaced

Solution; Let A= first drawn ball is black

B= second drawn is black

Required p A  B

a. p A  B  pB A. p A  3 / 94 10  2 15


b. p A  B  p A. pB  4 104 10  4 25

4.7. Theorem of total probability and Bayes’ Theorem

As we have already noted in the introduction, the basic objective behind calculating
probabilities is to help us in making decisions by quantifying the uncertainties involved in the
situations. Quite often, whether it is in our personal life or our work life, decision-making is
an ongoing process. Consider for example, a seller of winter garments, who is interested in
the demand of the product. In deciding on the amount he should stock for this winter, he has
computed the probability of selling different quantities and has noted that the chance of
selling a large quantity is very high. Accordingly, he has taken the decision to stock a large
quantity of the product. Suppose, when finally the winter comes and the season ends, he
discovers that he is left with a large quantity of stock. Assuming that he is in this business, he
feels that the earlier probability p him decide on the stock for the next winter. Similar to the
situation of the seller of winter garment, situations exist where we are interested in an event
on an ongoing basis. Every time some new information is available, we do revise our odds
mentally. This revision of probability with added information is formalised in probability
theory with the help of famous

162
Basic Statistics

Bayes' Theorem.

The theorem, discovered in 1761 by the English clergyman Thomas Bayes, has had a
profound impact on the development of statistics and is responsible for the emergence of a
new philosophy of science. Bayes himself is said to have been unsure of his extraordinary
result, which was presented to the Royal Society by a friend in 1763 - after Bayes' death. We
will first understand The Law of Total Probability, which is helpful for derivation of Bayes'
Theorem.

Partition of sample space: A collection of events {B1, B2, . . . , Bn} of a sample space S is
called a partition of S if B1, B2, . . . , Bn are mutually exclusive and B1∪ B2 ∪ ·· · ∪ Bn= S.

Theorem of total probability: If the events B1, B2, . . . , Bn constitute a partition of the
sample space S such that P(Bi) ≠0 for i= 1, 2, . . . , n, then for any event A of S, P(A) = P(A ∩
B1) +

P(A ∩ B2) + P(A ∩ B3) + . . . + P(A ∩ Bn) =P(B1)P(A\B1) + P(B2)P(A\B2) + . . . +


P(Bn)P(A\Bn)

Example: In a certain assembly plant, three machines, B1, B2, and B3, make 30%, 45%, and
25%, respectively, of the products. It is known from past experience that 2%, 3%, and 2% of
the products made by each machine, respectively, are defective. Now, suppose that a finished
product is randomly selected. What is the probability that it is defective?

Solution: Consider the following events: A: the product is defective, B1: the product is made
by machine B1, B2: the product is made by machine B2, B3: the product is made by machine
B3. Then, P(B1) = 0.3, P(B2) = 0.45, P(B3) = 0.25 , P(A|B1) =0.02, P(A|B2) = 0.03, P(A|B3)
= 0.02 Applying the theorem of total probability,

P(A) = P(B1)P(A|B1) + P(B2)P(A|B2) + P(B3)P(A|B3). = (0.3) (0.02) + (0.45) (0.03) + (0.25)


(0.02) = 0.006+0.0135+ 0.005= 0.0245

163
Basic Statistics

SELF-ASSESSMENT QUESTIONS

1. Explain what you understand by the term ‘probability’. How is the concept of?

Probability is relevant to decision making under uncertainty?

2. What are different approaches to the definition of probability? Are these approaches
contradictory to one another? Which of these approaches you will apply for Calculating the
probability that:

(a) A leap year selected at random, will contain 53 Monday.

(b) An item, selected at random from a production process, is defective.

(c) Mr. Bhupinder S. Hooda will win the assembly election from Kiloi.

3. With the help of an example explain the meaning of the following:

(a) Random experiment, and sample space

(b) An event as a subset of sample space

(c) Equally likely events

(d) Mutually exclusive events.

(e) Exhaustive events

(f) Elementary and compound events.

4. State and develop the Addition Theorem of probability for:

(a) Mutually exclusive events

(b) Overlapping events

164
Basic Statistics

(c) Complementary events

5. Explain the concept of conditional probability with the help of a suitable example.

6. State and develop the Multiplication Theorem of probability for:

(a) Dependent events

(b) Independent event

7. State the Bayes’ Theoram of probability. Using an appropriate example, develop the

Bayesian probability rule and generalize it.

8. What do you understand by permutations and combinations?

(a) In how many ways we can select three players out of 12 players of the Indian Cricket
team, for playing in the World XI team?

(b) In how many ways can a sub-committee of 2 out of 6 members of the executive committee
of the employees’ association be constituted?

9. What is the probability that a non-leap year, selected at random, will contain

(a) 52 Sundays? (b) 53 Sundays? (c) 54 Sundays?

10. A card is drawn at random from well shuffled deck of 52 cards, find the probability

that

(a) the card is either a club or diamond

(b) the card is not a king

(c) the card is either a face card or a club card.

11. From a well-shuffled deck of 52 cards, two cards are drawn at random.

165
Basic Statistics

(a) If the cards are drawn simultaneously, find the probability that these consists of

(i) Both clubs,

(ii) A king and a queen,

(iii) A face card and a 8.

(b) If the cards are drawn one after the other with replacement. Find the probability that these
consists of

(i) Both clubs,

(ii) A king and a queen,

(iii) A face card and a 8.

12. A problem in mathematics is given to four students A, B, C, and D their chances of

Solving it are 1/2, 1/3, 1/4 and 1/5 respectively. Find the probability that the problem

Will

(a) Be solved

(b) Not be solved

13. The odds that A speaks the truth are 3:2 and the odds that B does so are 7:3. In what
Percentage of cases are they likely to

(a) Contradict each other on an identical point?

(b) Agree each other on an identical point?

166
Basic Statistics

14. Among the sales staff engaged by a company 60% are males. In terms of their
professional qualifications, 70% of males and 50% of females have a degree in marketing.
Find the probability that a sales person selected at random will be

(a) A female with degree in marketing

(b) A male without degree in marketing

15. A factory has three units A, B, and C. Unit A produces 50% of its products, and units B
and C each produce 25% of the products. The percentage of defective items produced by A,
B, and C units are 3%, 2% and 1%, respectively. If an item is selected at random from the
total production of the factory is found defective, what is the probability that it is produced
by:

(a) Unit A

(b) Unit B

(c) Unit C

167
Basic Statistics

4.8. PROBABILITY DISTRIBUTION

Introduction

In many situations, our interest does not lie in the outcomes of an experiment as such; we may
find it more useful to describe a particular property or attribute of the outcomes of an
experiment in numerical terms. For example, out of three births; our interest may be in the
matter of the probabilities of the number of boys. Consider the sample space of 8 equally
likely sample points.

GGG GGB GBG BGG

GBB BGB BBG BBB

Now look at the variable “the number of boys out of three births”. This number varies among
sample points in the sample space and can take values 0,1,2,3, and it is random –given to
chance.

“A random variable is an uncertain quantity whose value depends on chance.”

A random variable may be…

Discrete if it takes only a countable number of values. For example, number of dots on two
dice, number of heads in three coin tossing, number of defective items, number of boys in
three births and so on.

Continuous if can take on any value in an interval of numbers (i.e. its possible values are
unaccountably infinite). For example, measured data on heights, weights, temperature, and
time and so on.

A random variable has a probability law - a rule that assigns probabilities to different values
of the random variable. This probability law - the probability assignment is called the
probability distribution of the random variable. We usually denote the random variable by X.

168
Basic Statistics

In this lesson, we will discuss discrete probability distributions and Continuous probability
distributions.

4.8.1. Definition of random variables and probability distributions

Concept of random variable (r.v):-

Variable: - is any characteristic or attribute that can assume different values.

A random variable (r.v):- is a variable whose values are determined by chance.

- It is a function which associates a number (real number) to each possible outcome of an


experiment. It is often the case that our primary interest is in the numerical value of the
random variable rather than the outcome itself. The following examples will help us make this
idea clear.

E.g.1: suppose a coin is tossed three times. Let X be the number of heads.

Solution: If we toss a coin three times, then the experiment has a total of eight possible
outcomes, and they are as follows: S= {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}.
Since X is the characteristic, which denotes the number of heads out of the three tosses, ar.v,
X is associated with each outcome of this experiment. Therefore, X is a function defined on
the elements of S and the possible values of X are {0, 1, 2, and 3}.

Specifically X(HHH)=3, X(HHT)=X(HTH)=X(THH)=2, X(HTT)=X(THT)=X(TTH)=1,


X(TTT)=0

Discrete Random variable – let x be ar.v. If the number of possible values of x is finite or
countable infinite, we call x a discrete r.v.

- The possible values of x can be listed as x1, x2, x3… xn

- Let x be discrete r.v with each possible outcome x, we associate a number P (xi) = P (X=xi)
called the probability of x. The numbers P (xi) must satisfy the following requirements for
probability distribution

169
Basic Statistics

1. The sum of the probabilities of all the events in the sample space must be equal to 1.i.e

2. The probability of each event in the sample space must be between or equal to zero and one
(0&1), i.e. 0 .

-(xi, P(xi) ) is called Discrete probability Distribution

The random variable X denoting “the number of boys out of three births”,is a discrete random
variable; so it will have a discrete probability distribution. It is easy to visualize that the
random variable X is a function of Sample space. We can see the correspondence of sample
points with the values of the random variable as follows:

GGG BGG GBG BBG BGB GBB BBB


GGB
(X=0) (X=2) (X=3)
(X=1)

The correspondence between sample points and the value of the random variable allows us to
determine the probability distribution of X as follows:

P(X=0) = 1/8 since one out of 8 equally likely points leads to X = 0

P(X=1) = 3/8 since three out of 8 equally likely points leads to X = 1

P(X=2) = 3/8 since three out of 8 equally likely points leads to X = 2

P(X=3) = 1/8 since one out of 8 equally likely points leads to X = 3

The above probability statement constitute the probability distribution of the random variable
X = number of boys in three births. We may appreciate how this probability law is obtained
simply by associating values of X with sets in the sample space. (For example, the set GGB,
GBG, BGG leads to X = 1). We may write down the probability distribution of X in table
format or we may plot it graphically by means of probability Histogram or a Line chart.

170
Basic Statistics

Probability Distribution of the Number of Boys out of Three Births

No. of Boys X Probability


P(X)
0 1/8
1 3/8
2 3/8
3 1/8

Figure: Probability Distribution of the Number of Boys out of Three Births

Exercise: Tossing a coin twice. Let x be the number of heads. Construct a probability
distribution.

Cumulative Distribution Function

The probability distribution of a discrete random variable lists the probabilities of occurrence
of different values of the random variable. We may be interested in cumulative probabilities
of the random variable. That is, we may be interested in the probability that the value of the
random variable is at most some value x. This is the sum of all the probabilities of the values I
of X that are less than or equal to x.
171
Basic Statistics

The cumulative distribution function (also called cumulative probability function) F(X =x) of
a discrete random variable X is

F(X = x) = P(X ≤x) =

For example, to find the probability of at most two boys out of three births, we have

F(X = 2) = P(X ≤2) =

= P(X = 0) + P(X = 1) + P(X = 2)

=1/3 + 3/8 + 3/8

= 7/8

Continuous Random variable – x is continuous if it assume all values in some interval (c, d)
where c, d ε R and there exist a function f, called the probability density function (pdf) of x
satisfying the following conditions.

a. f(x)≥0.∀x
b. =1
c. For any a and b with -∞<a<b<∞, we have

P (a<x<b) =

Remark: a. P(x=a)=0

b. P(a < x < b)= P(a ≤ x ≤ b)= P(a < x ≤ b)= P(a ≤ x < b)

172
Basic Statistics

Example1. Determine whether each distribution is a probability distribution

a.
x 0 1 2 3
p(x) ¼ ¼ ¼ ¼

X 0 1 2 3
b. P(x) -1 ½ ¼ ¼

c. x 1 2 3 4
P(x) ¼ ¼ ½ ¼

Exercise: Construct a probability distribution for rolling a single die.

Probability distribution can be shown graphically by representing the


values of x on the x-axis and the probability p(x) on they- axis

Exercise: construct a probability distribution for the number of girls


a family with two children has.

4.9. Mean, Variance and Expectation of R.v

Defn: - let x be a discrete r.v with possible values x1, x2, x3… xn… with probability P(x1),
P(x2) … P (xn) … respectively. Then the expected value of x or the mean value of x denoted
by E(x) or μx respectively is defined as μx=E(x) =

- If x assume finite number of values

μx=E(x) =

- If all outcomes are equally likely

E(x) =1/n ( )

173
Basic Statistics

E.g.4: In a family with two children, Find the mean of the number of children who will be
girls.

E.g.5: One thousand tickets are sold at $1 each for a color television valued at $ 350. What is
the expected value of the girls if a person purchases one ticket?

Defn: - let x be ar.v, the variance of x denoted by var(x) or is defined as

Var(x) = = E{x-E(x)}2 or

Var(x) = = -

- The standard deviation is the square root of the variance

= =

Note: - The variance and standard deviation cannot be negative

E (ax + b) = a E(x) + b, V (ax + b)= V(x)

E.g.6: Find the mean and variance of the number of spots that appear when a die is rolled

Find the mean and variance of the number of spots that appear when a die is rolled
X 1 2 3 4 5 6
P(x) 1/6 1/6 1/6 1/6 1/6 1/6

Exercise:- Calculate the mean and variance of the following distribution


x 2 3 4 5 6 7
p(x) 1/12 2/12 3/12 3/12 2/12 1/12

174
Basic Statistics

4.10. Common discrete probability distributions

Probability Distributions are Theoretical Distributions

Consider a random variable X that measure the “number of heads” in a three-trial coin tossing
experiment. The probability distribution of X will be

X 0 1 2 3
P(X=x) 1/8 3/8 3/8 1/8

Now imagine this experiment is repeated 200 times, we may expect ‘no head’ and ‘three
heads’ will each occur 25 times; ‘one head’ and ‘two heads’ each will occur 75 times. Since
these results are what we expect on the basis of theory, the resultant distribution is called a
theoretical or expected distribution.

However, when the experiment is actually performed 200 times, the results, which we may
actually obtain, will normally differ from the theoretically expected results. It is quite possible
that in actual experiment ‘no head’ and ‘three heads’ may occur 20 and 28 times respectively
and ‘one head’ and ‘two heads’ may occur 66 and 86 times respectively. The distribution so
obtained through actual experiment is called the empirical or observed distribution.

In practice, however, assessing the probability of every possible value of a random variable
through actual experiment can be difficult, even impossible, especially when the probabilities
are very small. But we may be able to find out what type of random variable the one at hand is
by examining the causes that make it random. Knowing the type, we can often approximate
the random variable to a standard one for which convenient formulae are available.

The proper identification of experiments with certain known processes in Probability theory
can help us in writing down the probability distribution function. Two such processes are the
Bernoulli Process and the Poisson Process. The standard discrete probability distributions
that are consequent to these processes are the Binomial and the Poisson distribution. We will
now look into the conditions that characterize these processes, and examine the standard
distributions associated with the processes. This will enable us to identify situations for which
these distributions apply.

175
Basic Statistics

Let us first study the Bernoulli random variable, named so in honor of the mathematician
Jakob Bernoulli (1654-1705). It is the building block for other random variables and the
resulting distributions we will study in this lesson.

1. Bernoulli random variable

Suppose an operator uses a lathe to produce pins, and the lathe is not perfect in the sense that
it does not always produce a good pin. Rather, it has a probability p of producing a good pin
and (1 - p) of producing a defective one. Let us denote a good pin as “success” and a
defective pin as “failure”.

Just after the operator produces one pin, it is inspected; let X denote the "number of good pins
produced” i. e. “the number of successes”. Now analyzing the trial- “inspecting a pin” and
our random variable X-“number of successes”, we note two important points:

 The trial-“inspecting a pin” has only two possible outcomes, which are mutually exclusive.
Such a trial, whose outcome can only be either a success or a failure, is a Bernoulli trial. In
other words, the sample space of a Bernoulli trial is

S = {success, failure}

 The random variable, X, that measures number of successes in one Bernoulli trial, is a
Bernoulli random variable. Clearly, X is 1 if the pin is good and 0 if it is defective.

It is easy to derive the probability distribution of Bernoulli random variable

It is easy to derive the probability distribution of Bernoulli random variable

X: 0 1
P(X): p 1-p

If X is a Bernoulli random variable, we may write

X ~ BER (p)

Where ~ is read as “is distributed as” and BER stands for Bernoulli.
176
Basic Statistics

A Bernoulli random variable is too simple to be of immediate practical use. But it forms the
building block of the Binomial random variable, which is quite useful in practice. The
binomial random variable in turn is the basis for many other useful cases, such as Poisson
random variable.

2. Binomial Distribution

In the real world we often make several trials, not just one, to achieve one or more successes.

Let us consider such cases of several trials.

Consider n number of identically and independently distributed Bernoulli random


variablesX1, X2 ………, Xn. Here, identically means that they all have the same p, and
independently means that the value of one X does not in any way affects the value of another.
For example, the value of X2 does not affect the value of X3 or X8 and so on. Such a sequence
of identically and independently distributed Bernoulli variables is called a Bernoulli Process.

Suppose an operator produces n pins, one by one, on a lathe that has probability p of making a
good pin at each trial, the sequence of numbers (1 or 0) denoting the good and defective p in s
produced in each of the n trials is a Bernoulli process. For example, in the sequence of nine
trials denoted by

001011001

The third, fifth, sixth and ninth are good pins, or successes. The rest are failures. In practice,
we are usually interested in the total number of good pins rather than the sequence of 1's and
0's. In the example above, four out of nine are good. In the general case, let X denote the total
number of good pins produced in n trials. We then have X = X1 + X2 +………+ Xn where all
Xi ~ BER(p) and are independent.

The random variable that counts the number of successes in many independent, identical
Bernoulli trials is called a Binomial Random Variable.

Binomial Distribution is used to represent the probability distribution of discrete random


variables. Binomial means two categories. The successive repetition of an observation (trial)
177
Basic Statistics

may result in an outcome which possesses or which does not possess a specified character.
Our primary interest will be either of these possibilities. Conventionally, the outcome of
primary interest is termed as success. The alternative outcome is termed as failure. These
terminologies are used irrespective of the nature of the outcome. For example, non-
germination of a seed may be termed as success.

In binomial distribution the experiment consisting the following criteria

 There is only two outcomes in Bernoulli trials (success or failure)


 Fixed number of trials (n) i.e. n should be discrete
 At each trial the probability of success (p) remains the same
 n trials are independent.

The variable X which represents the count of the number of successes in Bernoulli trials will
be a discrete random variable. The probability distribution of such discrete random variable X
is called the binomial distribution.

The binomial distribution is given by the probability mass function ( pmf)


n
P[X=x] =   p x q n  x for all possible values of X.
 x
In the formula, n= number of trials
x= number of successes in a trial
n-x = number of failures in a trial
p = probability of success (= x/n)
q = 1 - p = probability of failure
 n
  = the possible number of ways in which x successes can occur.
 x
The binomial distribution is determined by two parameters n and p.

The expected value of the binomial distribution is np and the variance of it is npq.

Remark

 the mean of the Binomial distribution is


178
Basic Statistics

n
E( X )   x P( X  x)
x 0

n
= x
x 0
n
c x p x q n x

n
= x x 0
n
c x p x q n x

n
n!
=  x x!(n  x)! p q
x 0
x n x

n
n(n  1)!
=  x x( x  1)!(n  x)! p p
x 0
x 1 n x
q

n
(n  1)!
= np  ( x  1)!(n  x)! p
x 1
x 1 n x
q

n
= np 
x 1
n1
c x1 p x1q n x

= np(q  p) n1

n 1
= np(1) [ q  p  1 ]

= np

 The mean of the binomial distribution is np

 Variance of the Binomial distribution:

The variance of the Binomial distribution is

V ( X )  E( X 2 )  [ E( X )]2

179
Basic Statistics

= E( X 2 )  (np) 2 …………….. (1) [ E( X )  np ]

Now,

n
E( X 2 ) = = x
x 0
2 n
c x p x q n x

n
= [ x( x  1)  x]
x 0
n
c x p x q n x

n n
n! n!
=  x( x  1)
x 0 x!(n  x)!
p x q n x +  x
x 0 x!(n  x)!
p x q n x

n
n(n  1)(n  2)!
=  x( x  1) x( x  1)( x  2)!(n  x)! p
x 0
2
p x 2 q n x  E ( X )

n
(n  2)!
= n(n  1) p
2
 ( x  2)!(n  x)! p
x 2
x 2
q n x  np

n
= n(n  1) p
2

x 2
n 2
c x2 p x2 q n x  np

= n(n  1) p
2
(q  p) n2  np

= n(n  1) p
2
(1) n2  np [ q  p  1 ]

= n(n  1) p  np …………. (2)


2

Putting (2) in (1) we get

V (X )  n(n  1) p  np - (np) 2
2

= np(np  p  1  np)

180
Basic Statistics

= np(1  p)

= npq

The variance of the Binomial distribution is npq

The binomial distribution approaches normal distribution as the number of trials n tends to
large (n→  ) for any fixed value of p. A rule of thumb is that for p < 0.5, the normal
approximation is adequate if np > 15. Departures from the given conditions result in less
accurate approximations.

When n is very large and p is very small (n→∞ &p→0) the binomial distribution approaches
Poisson distribution.

Example1: -A given mid-exam contains 10 multiple choice questions, and each question has
four alternatives with one exact answer. Find the probability that the student exactly answered

i. 3 questions iii. At least 3 questions


ii. 8 questions

Solution

Using binomial distribution we can get the probability value easily. That is n = 10,

p = ¼ (the chance of getting answer from 4 alternatives)

q = 1- p = 1- ¼ = ¾

The possible marks for a student from 10 questions are X = 0, 1, 2, 3. . . 10.

P(X = x) = (nx) pxqn-x

i. P(X = 3) = (103) (0.25)3(0.75)7 = 0.250


ii. P(X = 8) = (108) (0.25)8 (0.75)2 = 0.00386
iii. P(X >= x) = 1 - P(X < x). Hence P(X >= 3) = 1 – P(X < 3)

181
Basic Statistics

= 1 – {P(X = 0) + P(X = 1) + P(X = 2)}

P(X = 0) = (100) (0.25)0(0.75)10 = 0.0563

P(X = 1) = (101) (0.25)1(0.75)9 = 0.1877

P(X = 2) = (102) (0.25)2(0.75)8 = 0.2816

.’. P(X >= 3) = 1 – (0.0563 + 0.1877 + 0.2816) = 0.4744

The mean = np = 2.5. The variance = npq = 1.875

Example 2: Find the probability of getting five heads and seven tails in 12 flips of a balanced coin.

Solution: Given n = 12 trials. Let X be the number of heads.Then, p = Prob. of getting a head =1/2, and
q = prob. of not getting a head=1/2. Therefore, the probability of getting k heads in a random
trial of a coin 12 times is:

12 
12  12   
12 x      5   792  0.1934
12  1   1 
x
P ( X  5) 
P( X  x)        x   x  . And for x =5,
 12  4096 4096 .
 x  2   2  2 4096

Example 3: If the probability is 0.20 that a person traveling on a certain airplane flight will request a
vegetarian lunch, what is the probability that three of 10 people traveling on this flight will
request a vegetarian lunch?

Solution: Let X be the number of vegetarians. Given n = 10, p = 0.20, x = 3; then,


10 
P( X  3)   0.2 (0.8) 7  0.201 .
3

 
3

Checklist 2

Put a tick mark (√) for each of the following questions if you can solve the problems, and an
X otherwise.

1. Can you state the assumptions underlying the binomial distribution?


2. Can you write down the mathematical formula of the binomial distribution?
182
Basic Statistics

3. Can you compute probabilities of events in a binomial distribution?


4. Can you define and compute probabilities with hyper geometric rule?

Exercise 1

1. The probability that a patient recovers from a rare blood disease is 0.4. If 100 people are
known to have contracted this disease, what is the probability that less than 30 survive?
2. A multiple-choice quiz has 200 questions each with 4 possible answers of which only 1 is the
correct answer. What is the probability that sheer guess-work yields from 25 to 30 correct
answers for 80 of the 200 problems about which the student has no knowledge?
3. A component has a 20% chance of being a dud. If five are selected from a large batch, what is
the probability that more than one is a dud?
4. A company owns 400 laptops. Each laptop has an 8% probability of not working. You
randomly select 20 laptops for your salespeople. (a) What is the likelihood that 5 will be
broken?(b) What is the likelihood that they will all work?
5. A study indicates that 4% of American teenagers have tattoos. You randomly sample 30
teenagers. What is the likelihood that exactly 3 will have a tattoo?
6. An XYZ cell phone is made from 55 components. Each component has a .002 probability of
being defective. What is the probability that an XYZ cell phone will not work perfectly?
7. The ABC Company manufactures toy robots. About 1 toy robot per 100 does not work.
You purchase 35 ABC toy robots. What is the probability that exactly 4 do not work?
8. The LMB Company manufactures tires. They claim that only .007 of LMB tires are
defective. What is the probability of finding 2 defective tires in a random sample of 50 LMB
tires?
9. An HDTV is made from 100 components. Each component has a .005 probability of being
defective. What is the probability that an HDTV will not work perfectly?

3. Poisson distribution

Poisson distribution was developed by a French Mathematician Simeon D Poisson (1781-


1840). The Poisson distribution is also used to represent the probability distribution of a
discrete random variable. It is employed in describing random events that occur rarely over a
continuum of time or space. The Poisson distribution bears a close similarity to the binomial
183
Basic Statistics

distribution. Suppose that we are interested in the number of occurrences of an event E in a


time period of length t. This time period can be split into n equal intervals, each of length t/n.
These n intervals can be treated as n trials by Bernoulli process. But there is difficult. Since
the event occurs at various points of time, it can occur twice or more in one of the trials of
length t/n.

In case of binomial distribution the event is dichotomous, and hence there is no possibility of
such multiple occurrences within a single trial. In order to overcome this difficulty we make n
larger and larger. When n is large, the trials are shorter in terms of length of time. As a result,
the probability of occurrence of an event in a single trial would be smaller. It is equivalent of
saying that it is a rare event. The binomial distribution can still be used to represent the
distribution of such random events. However, the computations become tedious since n is
very large. This can be explained by example.

Suppose that the number of insects caught in a trap is being studied and that the data are
collected on the number of insects caught per hour. Assume that the probability that an insect
will be caught in any single minute is 0.06. Assume further that the events of insects being
trapped are mutually independent and the probability p = 0.06 remains same for all the
minutes. We may use the binomial distribution to calculate the number of insects caught per
hour by considering each minute as a separate Bernoulli trial. If x is the number of insects
caught in a minute then we have

 60 

P[X=x] =   0.06  0.94
x 60 x

x

Instead of dividing the hour into minutes the seconds may be used as basic units. Then the
value of p would be reduced to, p=0.06/60=0.001. Considering each second as a Bernoulli
trial, we would have a sample size 60  60=3600 for a period of one hour. The binomial
distribution would now be

 3600 
P[X=x] =  0.001x 0.9993600 x
 x 

184
Basic Statistics

Thus when n becomes larger and larger the computations using binomial become tedious.
n
Fortunately, it has been shown by Poisson that the value of   p x q n  x approaches the value
 x

of
np  e  np
x
, when n becomes large and p becomes small in such a way that the equality, np
x!
=  is maintained.

The Poisson distribution is given by the pmf,

e  x
P[X=x] = . In the formula,  = np = mean number of times an event
x!
occurs.

x = the number of times an event occur.

e= Naperian base equaling 2.7182…

The value of e   can be obtained directly from mathematical tables. In case of Poisson
distribution the counts of alternative events, i.e., failures are not of interest. This is a contrast
between binomial and Poisson distributions. For Poisson distribution all that we need is np,
the mean number of successes. We need not know about n and p individually. Thus, the
Poisson distribution is determined by the parameter . .The special property of Poisson
distribution is that its mean and variance are same to  . i.e. mean = variance = .

Example1: At a parking place the average number of car-arrivals during a specified period of
15 minutes is 2. If the arrival process is well described by a Poisson process, find the
probability that during a given period of 15 minutes

(a) No car will arrive

(b) At least two cars will arrive

(c) Almost three cars will arrive

(d) Between 1 and 3 cars will arrive

185
Basic Statistics

Solution: Let X denote the number of cars arrivals during the specified period of 15 minutes.

So X

P(X=x)= = , where x=0, 1, 2,….

a. P(no car will arrive) = P(X = 0)=

=
b. P(at least two cars will arrive) = P(X ≥2)
=1-P(X<2)
=1-{P(X=0)+P(X=1)}

=1-{ }

=1-{0.1353 + 0.2707}

=1 – 0.4060

= 0.5940
c. P(atmost three cars will arrive) = P(X ≤3)

=P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3)

= 0.8571
d. P(between 1 and 3 cars will arrive) = P(1≤X ≤3)

= P(X ≤3) - P(X = 0)

= 0.8571 –0.1353

= 0.7218

186
Basic Statistics

Example2: -In some experiments it was observed that the incidence of stem fly in black gram
was 6 percent. Suppose we examine 50 black gram plants in a field at random. What is
probability that at most 3 plants will be found to be affected by stem fly?

Solution

The probability that a plant is affected by stem fly is given as 0.06.

The number of plants observed (n = 50). Hence,  = np = 3. The required probability is

P[X  3] = P[X = 0] + P[X = 1] + P[X = 2] + P[X = 3]

e  x e 3 30
P[X = x] = P[X = 0] = = e-3
x! 0!

e 3 31
P[X = 1] = = 3e-3
1!

e 3 3 2
P[X = 2] = = 4.5e-3
2!

e 3 33 27e 3
P[X = 3] =   4.5e 3
3! 6

 P[X  3 ] = 13e-3

From mathematical table it can found that e-3= 0.0498.

Therefore P[X  3] = 13  0.0498 = 0.6474. …//

Checklist

Put a tick mark (√) for each of the following questions if you can solve the problems, and an
X otherwise.

1. Can you approximate the binomial distribution with Poisson?


2. Can you state the conditions for these approximations?
187
Basic Statistics

3. Can you write down the pmf of the Poisson distribution?


4. Can you compute the probabilities related with the Poisson distribution?

4. Hypergeometric Distribution

We are interested in computing probabilities for the number of observations that fall into a
particular category. But in the case of the binomial distribution, independence among trials is
required. As a result, if that distribution is applied to, say, sampling from a lot of items (deck
of cards, batch of production items), and the sampling must be done with replacement of
each item after it is observed. On the other hand, the hypergeometric distribution does not
require independence and is based on sampling done without replacement.

Applications for the hyper geometric distribution are found in many areas, with heavy use in
acceptance sampling, electronic testing, and quality assurance. Obviously, in many of these
fields, testing is done at the expense of the item being tested. That is, the item is destroyed and
hence cannot be replaced in the sample. Thus, sampling without replacement is necessary.

In general, we are interested in the probability of selecting x successes from the M items
labeled successes and n − x failures from the N –M items labeled failures when a random
sample of size n is selected from N items. This is known as a hypergeometric experiment,
that is, one that possesses the following two properties: A random sample of size n is selected
without replacement from N items; and of the N items, M may be classified as successes and
N − Mare classified as failures. The number X of successes of a hypergeometric experiment is
called a hypergeometric random variable.

Definition: The probability distribution of the hypergeometric random variable X, the number
of successes in a random sample of size n selected from N items of which M are labeled

success and N – M labeled failure, is: for x = 0, 1, 2, . . ., n;

x≤M,n – x ≤ N –M.

The range of x can be determined by the three binomial coefficients in the definition, where x
and n−x are no more than M and N –M, respectively, and both of them cannot be less than 0.

188
Basic Statistics

Usually, when both M(the number of successes)and N − M(the number of failures) are larger
than the sample size n, the range ofa hypergeometric random variable will be x = 0, 1, . . ., n.

Let X be a Hypergeometric distribution with N items, selected sample size n among M


labeled success then:

Property 1: Mean: E(X) = µ = =

Property 2: Variance: Var(X) = E(X – E(x))2 =

Property 3: Moment Generating Function: Not Given

Remark

When the number of samples in the lot is large, then the hypergeometeric probability mass
function is approximated in to the probability mass function of a binomial random variable.

Example: Lots of 40 components each are deemed unacceptable if they contain 3 or more
defectives. The procedure for sampling a lot is to select 5 components at random and to reject
the lot if a defective is found. What is the probability that exactly 1 defective is found in the
sample if there are 3 defectives in the entire lot?

Solution: Using the hypergeometric distribution with n = 5, N = 40, M= 3, and x = 1, we find the

probability of obtaining 1 defective to be p(1; 40, 5, 3) = = 0.3011.Once again, this

plan is not desirable since it detects a bad lot (3 defectives) only about 30% of the time.

Example 1: Two balls are selected at random and removed from a bag containing 5 blue and 3
green balls in succession. Find the pmf of blue balls.

Solution: If we let X: selection of blue balls (success), then given are a = 5 (blue balls), b = 3
(green balls), n = 2.Then, the probability of selecting blue balls is:
C x  3 C2 x , x=0,1,2. So that, f (0)  3 , f (1)  15 , and f (2)  10 .
P( X  x)  f ( x)  5

8 C2 28 28 28

189
Basic Statistics

ACTIVITY:

An urn contains 8 blue balls and 12 white balls. If five are drawn at random, without
replacement. What is the probability that the sample will contain two blue and three white?

Among 16 applicants for a job, 10 have college degrees. If three of the applicants are
randomly chosen for interviews, what are the probabilities that: (a) none has college degrees;
(b) two have college degrees; (c) one has a college degree; (d) all three have college degrees?

Summary

 The binomial pmf is given


 n
by: P( X  x)    p x q n  x , x  0,1,2,, n .
 x
 For a binomial random variable, E( X )  np , and V ( X )  npq.
 The pmf of a negative binomial distribution is given by:
 x  1 k x  k
NB( x; k , p)    p q , for x  k , k  1, k  2,.
 k  1
 The pmf of a geometric distribution is: G(x; p)=pqx-1,for x=k, k+1,
k+2, ….
 When n is large and p is very small, the binomial is approximated

(np) x e  np
by the Poisson distribution as: P( X  x )  , for x  0,1, 2, .
x!
 The Poisson distribution is used to model rare events and the pmf is
given by:

xe
 P( X  x)  , x  0,1,2, . , where  is average number of
x!
successes.
 Both the Mean and Variance of a Poisson distribution equal to  .

190
Basic Statistics

4.11. Common Continuous Probability Distributions

In the first case, Binomial random variable X1 could take only finite number of integer
values;0,1,2…n; whereas in the second case, Poisson random variable X2 could take an
infinite number of integer value; 0,1,2,3………… The random variables X1 and X2 are
discrete, in the sense that they could be listed in a sequence, finite or infinite. In contrast to
these, let us consider a situation, where the variable of interest may take any value within a
given range.

Suppose we are planning for measuring the variability of an automatic bottling process that
fills ½-liter (500 cm3) bottles with cola. The variable, say X, indicating the deviation of the
actual volume from the normal (average) volume can take any real value - positive or
negative; integer or decimal. This type of random variable, which can take an infinite number
of values in a given range, is called a continuous random variable, and the probability
distribution of such a variable is called a continuous probability distribution. The concepts
and assumption inherent in the treatment of such distributions are quite different from those
used in the context of a discrete distribution. In the present lesson, after understanding the
basic concepts of continuous distributions, we will discuss Uniform, Normal and Exponential
distributions- an important continuous distribution that is applicable to many real-life
processes. A continuous random variable is a random variable that can take on any value in an
interval of numbers.

1. Uniform Distribution

One of the simplest continuous distributions in all of statistics is the continuous uniform
distribution. This distribution is characterized by a density function that is “flat,” and thus
the probability is uniform in a closed interval, say [a, b].Suppose you were to randomly select
a number X represented by a point in the interval . The density function of X is
represented graphically as follows.

191
Basic Statistics

Note that the density function forms a rectangle with base b−a and constant height to

ensure that the area under the rectangle equals one. As a result, the uniform distribution is
often called the rectangular distribution.

Definition 9.1:

A random variable of the shown in the above graph is called a uniform random variable.
Therefore, the probability density function for a uniform random variable, X with the
parameters of a and b is given by:

f(x) =

Property 1: Mean: E(X) = µ = =

Property 2: Variance: Var(X) = E(X – E(x))2 =

Property 3: Moment Generating Function:

Mx(t) = E(etx) = for

Example 9.1: The department of transportation has determined that the winning (low) bid X (in
dollars) on a road construction contract has a uniform distribution with probability density
function f(x) = , if < x< 2d, where d is the department of transportation estimate of the

cost of job. (a) Find the mean and SD of X. (b) What fraction of the winning bids on road
construction contracts are greater than the department of transportation estimate?

Solution: (a) E(X) = = (2d- 2d/2)/2 = d/2

V(X) = E(X – E(x))2 = = d2/12

(b) p(X > d) = = [x = (2d - d) =

192
Basic Statistics

Activity

Suppose the research department of a steel manufacturer believes that one of the company’s
rolling machines is producing sheets of steel of varying thickness. The thickness X is a
random variable with values between 150 and 200 millimeters. Any sheets less than 160
millimeters thick must be scrapped, since they are unacceptable to buyers. (a) Calculate the
mean and variance of X (b) Find the fraction of steel sheets produced by this machine that
have to be scrapped.

2. Normal distribution: is a continuous, symmetric, bell shaped distribution of a variable.

The Normal Distribution is the most versatile of all the continuous probability distributions. It
is being widely used in all data-based research in the field of agriculture, trade, business and
industry It is found to be useful in characterizing uncertainties in many real-life processes, in
statistical inferences, and in approximating other probability distributions. A large number of
random variables occurring in practice can be approximated to the normal distribution.

A random variable that is affected by many independent causes, and the effect of each
cause is not overwhelmingly large compared to other effects, closely follow a normal
distribution.

The lengths of pins made by an automatic machine; the times taken by an assembly worker to
complete the assigned task repeatedly; the weights of baseballs; the tensile strengths of a
batch of bolts; and the volumes of cola in a particular brand of canned cola - are good
examples of normally distributed random variables. All of these are affected by several
independent causes where the effect of each cause is small. This knowledge helps us in
calculating the probabilities of different events in varied situations, which in turn is useful for
decision-making.

In many real life situations, we face the problem of making statistical inferences about
processes based on limited data. Limited data is basically a sample from the full body of data
on the process. Irrespective of how the full body of data is distributed, it has been found that
the Normal Distribution can be used to characterize the sampling distribution of many of the
sample statistics. This helps considerably in Statistical Inferences.
193
Basic Statistics

Finally, the Normal Distribution can be used to approximate certain probability distributions.
This helps considerably in simplifying the probability calculations.

Properties of normal distribution

1. A normal distribution curve is bell shaped.


2. The mean, median, and mode are equal and are located at the center of the distribution.
3. A normal distribution curve is unimodal (i.e. it has only one mode).
4. The curve is symmetric about the mean, which is equivalent to saying that its shape is the
same on both sides of a vertical line passing through the center.
5. The curve is continuous, that is, there is no gap or holes. For each value of x, there is a
corresponding value of y.
6. The curve never touches the x axis. Theoretically, no matter how far in either direction or the
curve extends, it never meets the x axis- but it gets increasingly closer.
7. The total area under a normal distribution curve is equal to 1.00 or 100%. This fact may seem
unusual, since the curve never touches the x axis, but one can prove it mathematically by
using calculus.
8. The area under the part of a normal curve that lies within 1 s.ds of the mean is approximately
0.68, or 68%; within 2 s.ds, about 0.95, or 95%; and within 3 s.ds, about 0.997, or 99.7%.

The probability distribution of a normal distribution with mean μ and variance is given by

f(x) = , -∞<x<∞, -∞<μ<∞, 0< <∞

Probability of a value x of a normal distribution between two numbers a and b is given by

P (a<x<b) =

But, this integral is a definite integral which tedious to compute, to overcome this problem we
standardize the value and we use the table of standard normal distribution to compute the
probabilities.

194
Basic Statistics

Example: A cost accountant needs to forecast the unit cost of a product for the next year. He
notes that each unit of the product requires 10 labor hours and 5 kg of raw material. In
addition, each unit of the product is assigned an overhead cost of Rs 200. He estimates that
the cost of a labor hour next year will be normally distributed with an expected value of Rs 45
and a standard deviation of Rs 2; the cost of raw material will be normally distributed with an
expected value of Rs 60 and a standard deviation of Rs 3. Find the distribution of the unit cost
of the product. Find its expected value and variance.

Solution: Since the cost of labor L may not influence the cost of raw material M, we can
assume that the two are independent. This makes the unit cost of the product Q a random
variable. So if

L ~ N (45, 22) and M ~ N (60, 32)

Then, Q = 10L + 5M + 200 will follow normal distribution with

Mean = E(Q) = 10E(L) + 5E(M) + 200

= 10(45) + 5(60) + 200

= 950

Variance = V(Q) = 102V(L) + 52V(M)

= 100(4) + 25(9)

= 625

So Q ~ N (950, 252)

Same important area relationships under normal curve are


Area between μ - 1σ and μ + 1σ is about 0.6826
Area between μ - 2σ and μ + 2σ is about 0.9544
Area between μ - 3σ and μ + 3σ is about 0.9974
Area between μ – 1.96σ and μ + 1.96σ is 0.95
Area between μ – 2.58σ and μ + 2.58σ is 0.99
195
Basic Statistics

The standard normal distribution- is a normal distribution with a mean of 0(zero) and a
standard deviation 1.

The formula for the standard normal distribution is

f(x) =

All normally distributed variables can be transformed into the standard normally distributed
variable by using the formula for the standard score:

Z= or Z =

The probability of any value x lies between two values a and b is given by the area under the
standard normal distribution.

Procedure to find the area under the standard normal distribution curve

1. Between 0 and any Z value: look up the Z value in the table to get the area.

2. In any tail:

a. Look up the Z value in the table to get the area.


b. Subtract the area from 0.5

3. Between two z values on the same side of the mean:

a. Look up both Z value to get the area.


196
Basic Statistics

b. Subtract the smaller area from the larger area.

4. Between two z values on opposite side of the mean:

a. Look up both Z value to get the area.


b. Add the areas.

5. To the left of any z value, where z is greater than the mean:

a. Look up both Z value to get the area.


b. Add 0.5 to the areas.

6. To the right of any z value, where z is less than the mean:

a. Look up both Z value to get the area.


b. Add 0.5 to the areas.

7. In any two tail:

197
Basic Statistics

a. Look up the Z value in the table to get the area.


b. Subtract both areas from 0.5.

c. Add the answers.

Procedure

1. Draw the picture.


2. Shade the area desired.
3. Find the correct figure.
4. Follow the direction.

Example 1: Find the area under the standard normal distribution which lies.

a. between Z=0 & Z=0.96

P (0<Z<0.96) =?

P (0<Z<0.96) =0.3315

b. Between Z= -1.45 & Z=0

P (-1.45<Z<0) =?

P (-1.45<Z<0) =P (0<Z<1.45) Because of symmetric

198
Basic Statistics

=0.4265

c. The area to the right of Z = -0.35

P (Z>-0.35) = P (-0.35<Z<0) + P(Z>0)

= P (0<Z<0.35) + 0.5

= 0.1368 + 0.5

= 0.6368

d. To the left of Z= -0.35

P (Z<-0.35) = 1-P (Z -0.35)

= 1- 0.6368

= 0.3632

Example 2: Find the area under the standard normal curve which lies

a. Between Z=-0.67 & Z=0.75

P (-0.67 <Z<0.75) =?

199
Basic Statistics

=P (-0.67<Z<0) + P (0<Z<0.75)

=P (0<Z<0.67) +P (0<Z<0.75) since P (-0.67<Z<0) =P (0<Z<0.67) B/c


they are symmetric

=0.2486 + 0.2734

=0.522

b. Between z=2.13 and z=2.94

P (2.13 <z<2.94) =?

=p (0<z<2.94) - p (0<z<2.13)

=0.4984-0.4834=0.015

Example 3: Find z if

a) The normal curve area between 0 and z (positive) is 0.4726.

P (0<Z<z) = 0.4726

z=? z=1.92

b) the normal curve area to the left of z is equal to 0.9868

P (Z<z) =P (Z<0) +P (0<Z<z) = 0.9868

=0.5+P (0<Z<z) = 0.9868-0.5 = 0.4868

P (0<Z<z) =0.4868

200
Basic Statistics

z=2.22 from the normal table

The transformation of normal random variables

The importance of the standard normal distribution derives from the fact that
any normal random variable may be transformed to the standard normal
random variable. If we want to transform X, where X ~ N ( , ), into the
standard normal random variable Z ~ N (0, ), we can do this as follows:

Z=

We move the distribution from its center of μ to a center of 0. This is done by


subtracting μ from all the values of X. Thus, we shift the distribution μ units
back so that its new center is0. To make the standard deviation of the
distribution equal to 1, we divide the random variable by its standard deviation
σ. The area under the curve adjusts so that the total remains the same. All
probabilities (areas under the curve) adjust accordingly. Thus, the
transformation from X to Z is achieved by first subtracting μ from X and then
dividing the result by σ.

Note: The table gives the areas between 0 and any z value to the right of 0, and
all areas are positive. Then calculating the value of Z using

Z= , i.e Z ~ N (0, 1),

- Given a normally distributed r.v x with mean and standard deviation The
probability of any value x lies between two values a and b is given by

P (a<x<b) = p ( < < )

=p( <Z< )

201
Basic Statistics

Example: If X ~ N (50, 10 2), find the probability that the value of the random
variable X will be greater than 60.

Solution:

P(X >60) = P( > )

=P( > )

= P( Z >1)

= P( Z >0) - P(0 < Z <1)

= 0.5000 - 0.3413

= 0.1587

Example: The weekly wage of 2000 workmen is normally distribution with


mean wage of Rs 70 and wage standard deviation of Rs 5. Estimate the number
of workers whose weekly wages are

a. between Rs 70 and Rs 71
b. between Rs 69 and Rs 73
c. more than Rs 72 (d) less than Rs 65

Solution: Let X be the weekly wage in Rs, then

X ~ N (70, )

a. The required probability to be calculated is P(70 <X <71)

So P(70 <X <71)=P( )

=P( )

202
Basic Statistics

=P(0 <Z <0.2)


= 0.0793
So the number of workers whose weekly wages are between Rs 70 and Rs 71
= 2000 x 0.0793
= 159
(b) The required probability to be calculated is P(69 <X <73)
P(69<X <73)=P( )

=P( )

= P(-0.2 <Z <0.6)


= P(-0.2 <Z <0)+ P(0 <Z <0.6)
= P(0 <Z <0.2)+ P(0 <Z <0.6)
= 0.0793 + 0.2257
= 0.3050
So the number of workers whose weekly wages are between Rs 69 and Rs 73
= 2000 x 0.3050
= 610
(c) The required probability to be calculated is P(X >72)
P(X >72) = P( > )

=P( > )

= P(Z > 0.4)


= 0.5 - P(0 <Z <0.4)
= 0.5 – 0.1554
= 0.3446
So the number of workers whose weekly wages are more than Rs 72
= 2000 x 0.3446
= 689
(d) The required probability to be calculated is P(X <65)
P(X <65) = P( < )

=P( < )

203
Basic Statistics

= P(Z <-1.0)
= P(Z >1.0)
= P(Z >0) - P(0 <Z <1.0)
= 0.5 - 0.3413
= 0.1567
So the number of workers whose weekly wages are less than Rs 65
= 2000 x 0.1567
= 313

3. Exponential Distribution

Exponential distribution is an important density function that employed as a


model for the relative frequency distribution of the length of time between
random arrivals at a service counter when the probability of a costumer arrival
in any one unit of time is equal to the probability of arrival during any other. It
is also used as a model for the length of life of industrial equipment or products
when the probability that an “old” component will operate at least t additional
time units, given it is now functioning, is the same as the probability that a
“new” component will operate at least t time units. Equipment subject to
periodic maintenance and parts replacement often exhibits this property of
“never growing old”.

The exponential distribution is related to the Poisson probability distribution.


In fact, it can be shown that if the number of arrivals at a service counter
follows a Poisson probability distribution with the mean number of arrivals per
unit of time equal to .

204
Basic Statistics

Definition 9.4:

The continuous random variable X has an exponential distribution, with

parameter β, if its density function is given by: f(x) = ,x , .

Property 1: Mean: E(X) = µ = =

Property 2: Variance: Var(X) = E(X – E(x))2= - =

Property 3: Moment Generating Function:

Mx(t) = E(etx) = for

Remark

 A key property possessed only by exponential random variables is that they are
memoryless, in the sense that, for positive s and t, P{X >s + t|X >t} = P{X >s}.
If X represents the life of an item, then the memoryless property states that, for
any t, the remaining life of a t-year-old item has the same probability
distribution as the life of a new item. Thus, one need not remember the age of
an item to know its distribution of remaining life.

Example 9.7:Let X be an exponential random variable with pdf of : f(x) = ,x

then finf the mean and variance of the random variable X.

Solution: is E(X) = µ = = 2 and Var(X) = E(X – E(x))2 =4.

Example 9.8:The probability density of X is f (x) then what

is the mean and variance of this pdf?

205
Basic Statistics

Solution: this distribution is an exponential and the mean and variance it is obtain in
the manner as: E(X) = = 1/3 and V(X) = –
= 1/9.

ACTIVITY 9.3:

Assume X has an exponential distribution with parameter of and pdf


f(x)= , x > 0. Then find the value of as f(x) is pdf and identify the
value of x if P(X ) = ½.

SUMMARY

 A random variable X is said to be uniform over the interval (a, b) if its


probability density function is given by: f(x) = and 0

elsewhere. Its expected value and variance are: and , respectively.

 A random variable X is said to be normal with parameters μ and σ2 if its


probability density function is given by:
1 x μ 2
1 ( )
f(x)  e2 σ
,   x  ; σ  0 and   μ   ,the parameters μ and σ2
σ 2Π
are its expected value and variance.
 If X is normal with mean μ and variance σ2, then Z, defined by Z = is

normal with mean 0 and variance 1. Such a random variable is said to be a


standard normal random variable.
 A random variable whose probability density function is of the form f(x) =

, x , is said to be an exponential random variable with


2
parameter λ. Its expected value and variance are, respectively, and .

206
Basic Statistics

SELF-ASSESSMENT QUESTIONS

1. Explain what you understand by random experiment and a random variable.

Briefly explain the following:

a. Discrete and continuous random variables

b. Discrete probability distribution.

2. “Binomial random variable measures the number of successes in a Bernoulli


Process”.

Explain this statement. Also develop and generalize Binomial probability rule
with the help of an example.

3. State the important properties of a Binomial distribution. Give examples of


some of the important area where Binomial distribution is used.

4. Under what condition can the Poisson distribution approximate Binomial


distribution?

Develop the Poisson probability rule from the Binomial probability rule under
these conditions.

5. List some of the important areas where Poisson distribution is used. Also
state the important properties of a Poisson distribution.

6. On an average a machine produces 20 % defective item find the probability


that a random sample of 4 items consists of

(a) None to four defective items (b) at least 3 defective items

(c) Almost 2 defective items.

207
Basic Statistics

Out of 200 samples of 4 items, find the expected number of samples with (a),
(b), and (c) Above

7. If the sum of mean and variance of a binomial distribution of 5 trials is 9/5,


find the binomial distribution.

8. The mean and variance of a binomial distribution are 2 and 1.5 respectively.
Find the probability of

(a) 2 successes (b) at least 2 successes (c) at most 2 successes.

9. 150 random samples of 4 units each are inspected for number of defective
item. The results are: Number of defective items: 0 1 2 3 4

Number of Samples: 28 62 46 10 4

Fit a binomial distribution to the observed data.

10. The probability that a particular injection will have reaction to an


individual is

0.002. Find the probability that out of 1000 individuals (a) no, (b) 1, (c) at least

1, and (d) almost 2; individuals will have reaction from the injection.

11. In a razor blades manufacturing factory, there is small chance of 1/500 for
any blade to be defective. The blades are supplied in packets of 10. Find the
approximate number of packets containing (a) no, (b) 1, and (c) 2 defective
blades in a consignment of 10,000 packets.

12. The distribution of typing mistakes committed by a typist is given below:

Number of mistakes (X) : 0 1 2 3 4 5

Number of pages (f) : 142 156 69 27 5 1

208
Basic Statistics

Fit a Poisson distribution and find the expected frequencies.

EXERCISE

1. If there is any event A in the sample space(S), prove that


a. P(A/S) = P(A) and b) P(S/A) = 1
b. Let A and B be two events of the sample space with P (A/B) = 0.3 p(B/A) =
0.6 and P (A n B) = 0.3, then find
i. P(A) ii) P(B)
2. From your class of 20 female and 30 male total students the department head
wants to select 5 female and 7 male students for the purpose of specific
meeting
a. What is the possible number of ways to select those required students without
any restriction
b. What is the probability that 6 male and 3 female students to be included in to
the meeting.
3. Five economics, 2 accounting and 3 management books are to be arranged in a
row where books of the same subjects are not distinguishable from each other,
how many different ways of arrangement are possible?
4. There are 12 ways in which manufactured items can be minor defective and 10
ways in which it can be major defective. In how many ways can
i. One minor and one major defective occur?
ii. Two minor and 2 major defective occur?
5. Out of 3 Economists and 7 Accountants, a committee consisting of 2
Economist and 3 Accountants is to be formed.
i. In how many ways can this be done if
ii. Any Economists and Accountants can be included?
iii. One particular Accountants must be on the committee
iv. Two particular Economists cannot be on the committee
v. Find probabilities of the above i, ii, and iii?
6. The probability that a man will be alive 25 years is 0.3 and that his wife will be
alive 25 years is 0.4.

209
Basic Statistics

Find the probability that

i. both will be alive


ii. Only the man will be alive
iii. only the women will be alive
iv. at least one of them will be alive

7. A normal distribution has mean 62.4, find its standard deviation if 20.05% of
the area under the normal curve lies to the right of 72.9
8. A random variable has a normal distribution with standard deviation 5. Find
it’s mean if the probability that the random variable will assume a value less
than 52.5 is 0.6915

210
Basic Statistics

ASSIGNMENT QUESTION (LOAD: 30%)

1. The following present a list of different attributes and rules for assigning
numbers to objects. Try to classify the different measurement systems into one
of the four types of scales.

a. The order in which you were eliminated in a spelling bee as a measure of your
spelling ability.
b. Socioeconomic status of a family when classified as low, middle and upper
classes.

2. What are the major limitations of Statistics? Explain with suitable examples.
3. The accompanying data describe the hourly wage rates (dollars per hour) for
30 employees of an electronics firm:
22.66 24.39 17.31 21.02 21.61 20.97 18.58 16.61
19.74 21.57 20.56 22.16 20.16 18.97 22.64 19.62
22.05 22.03 17.09 24.60 23.82 17.80 16.28 19.34
22.22 19.49 22.27 18.20 19.29 20.43

a. Construct a frequency distribution and draw a histogram, frequency polygon


and cumulative frequency polygon for this data.
b. Calculate Mean, Median, Mode, Quartiles (Q1, Q2, and Q3), Deciles (D2, D6),
and Percentiles (P50, P70).
c. Kurtosis and skewness

4. If the permutation of the word WHITE is selected at random, how many of the
permutations
i. Begins with a consonant?
ii. Ends with a vowel?
iii. Has a consonant and vowels alternating?

211
Basic Statistics

5. If 3 books are picked at random from a shelf containing 5 novels, 3 books of


poems, and a dictionary, what is the probability that

a. The dictionary is selected?


b. 2 novels and 1 book of poems are selected?

6. A proofreader is interested in finding the probability that the number of


mistakes in a page will be less than 10. From his past experience he finds that
out of 3600 pages he has proofed, 200 pages contained no errors, 1200 pages
contained 5 errors, and 2200 pages contained 11 or more errors. Can you help
him in finding the required probability?
7. The LMB Company manufactures tires. They claim that only .007 of LMB
tires are defective. What is the probability of finding 2 defective tires in a
random sample of 50 LMB tires?
8. A gardener knows from his personal experiences that 2% of seedlings fail to
service on transplantation. Find the mean, standard deviation and moment
coefficient of skewness of the distribution of rate of failure to service in a
sample of 400 seedlings.
9. If P(x = 1) = P(x = 2), for a distribution of Poisson random variable X. Find
the mean of the distribution.
10. A random variable x has a normal distribution with mean 80 and standard
deviation 4.8. What is the probability that it will take a value?

a. Less than 87.2?


b. Greater than 76.4?
c. Between 81.2 and 86.0?

212
Basic Statistics

213

You might also like