Basic Statist final.-2
Basic Statist final.-2
JIMMA UNIVERSITY
DEPARTMENT OF STATISTICS
Prepared By:
Edited By:
Jimma, Ethiopia
i
Basic Statistics
Module Introduction
Dear students, first and for most you are warmly welcome to the course ‘Basic Statistics’. In
this course, you will provide students with a general understanding of statistical techniques
commonly used in solving business problems and undertaking business research. Topics
include frequency distributions, measures of central tendency, dispersion, and probability
theory.
Quite understandably, as a systematized set of activities and functions business involves the
application of both qualitative and quantitative techniques. Relating to our purpose, statistical
techniques are getting prominence in the face of changing technologies and complexities in
business and industry. Thus, these techniques are now considered as effective tools towards
solving business problems in addition to constituting an important segment of the study of
business in general. They can, however, never be substituted for human skills, experience, and
judgment. Many activities that are previously handled by verbal analysis and description have
proved to be more easily dealt with statistical techniques. The use of statistical tools can give
clarity and certainty in handling problems and enforce precision in stating the fact of situation
where these would otherwise be lost in emotion and argument. The use of statistical
knowledge in the field of business aid dated many years back. In recent years, an
understanding of statistical methods, techniques, and the skills to make use of them had
widely been recognized more than before. It is essential for anyone making business decisions
on the basis of data to possess a clear understanding of statistics. Among others, the vast and
fast changing technological, financial and economic setting has necessitated an organized use
and extensive application of statistical tools to business decision making. Statistics has proved
useful in many ways such as in establishing relationship, making predictions, and providing
solution to the many problems of business operations and managerial decision. Statistics is
widely applied in production and quality control, marketing research, manpower planning,
finance, etc.
The objective of this module enable students apply basic statistical techniques and methods
for grouping, tabular and graphical display, analysis and interpretation of statistical data.
ii
Basic Statistics
Enable students apply the concept of probability to quantify uncertainty and assess business
risk.. Besides, it is also to create know-how to students on various application areas and
benefit of statistics in business. The module is therefore designed to address your need to be
acquainted with the basic concepts and applications of statistics in executing business
activities. In accordance, this module is comprised of four chapters with respective sections
and subsections. The first chapter will introduce you to the introduction. Likewise, the second
chapter will be Visual Description of Data. The third chapter is mainly discussing about
statistical description of data. The fourth unit will be probability and Probability Distribution.
To facilitate your study, all sections and subsections in each chapter are made to include
examples with detail explanation of steps and procedures used in solving them. In addition,
you will find chapter summary and review problems at the end. The module will also provide
you with the self-assessment questions and exercises. review problems.
iii
Basic Statistics
Table of Contents
iv
Basic Statistics
v
Basic Statistics
Dear students, this chapter will introduce you to the introduction. You are provided with a
general understanding of statistics and statistical terms in business, branches of statistics
,stages of statistical investigation process, types of variables, sources of data, measurement
scale with scale types, scope of statistics and also importance of statistics in business .
1
Basic Statistics
For a layman, ‘Statistics’ means numerical information expressed in quantitative terms. This
information may relate to objects, subjects, activities, phenomena, or regions of space. As a
matter of fact, data have no limits as to their reference, coverage, and scope. At the macro
level, these are data on gross national product and shares of agriculture, manufacturing, and
services in GDP (Gross Domestic Product).
At the micro level, individual firms, howsoever small or large, produce extensive statistics on
their operations. The annual reports of companies contain variety of data on sales, production,
expenditure, inventories, capital employed, and other activities. These data are often field
data, collected by employing scientific survey techniques. Unless regularly updated, such data
are the product of a one-time effort and have limited use beyond the situation that may have
called for their collection. A student knows statistics more intimately as a subject of study like
economics, mathematics, chemistry, physics, and others. It is a discipline, which scientifically
deals with data, and is often described as the science of data. In dealing with statistics as data,
statistics has developed appropriate methods of collecting, presenting, summarizing, and
analyzing data, and thus consists of a body of these methods.
In the beginning, it may be noted that the word ‘statistics’ is used rather curiously in two
senses plural and singular. In the plural sense, it refers to a set of figures or data. In the
singular sense, statistics refers to the whole body of tools that are used to collect data,
organize and interpret them and, finally, to draw conclusions from them. It should be noted
that both the aspects of statistics are important if the quantitative data are to serve their
purpose. If statistics, as a subject, is inadequate and consists of poor methodology, we could
not know the right procedure to extract from the data the information they contain. Similarly,
if our data are defective or that they are inadequate or inaccurate, we could not reach the right
conclusions even though our subject is well developed.
A.L. Bowley has defined statistics as: (i) statistics is the science of counting, (ii) Statistics
may rightly be called the science of averages, and (iii) statistics is the science of measurement
of social organism regarded as a whole in all its manifestations. Boddington defined as:
2
Basic Statistics
Statistics is the science of estimates and probabilities. Further, W.I. King has defined
Statistics in a wider context, the science of Statistics is the method of judging collective,
natural or social phenomena from the results obtained by the analysis or enumeration or
collection of estimates.
Seligman explored that statistics is a science that deals with the methods of collecting,
classifying, presenting, comparing and interpreting numerical data collected to throw some
light on any sphere of enquiry. Spiegal defines statistics highlighting its role in decision-
making particularly under uncertainty, as follows: statistics is concerned with scientific
method for collecting, organizing, summa rising, presenting and analyzing data as well as
drawing valid conclusions and making reasonable decisions on the basis of such analysis.
According to Prof. Horace Secrist, Statistics is the aggregate of facts, affected to a marked
extent by multiplicity of causes, numerically expressed, enumerated or estimated according to
reasonable standards of accuracy, collected in a systematic manner for a pre-determined
purpose, and placed in relation to each other.
From the above definitions, we can highlight the major characteristics of statistics as follows:
(i) Statistics are the aggregates of facts. It means a single figure is not statistics. For example, national
income of a country for a single year is not statistics but the same for two or more years is
statistics.
(ii) Statistics are affected by a number of factors. For example, sale of a product depends on a
number of factors such as its price, quality, competition, the income of the consumers, and so
on.
(iii) Statistics must be reasonably accurate. Wrong figures, if analyzed, will lead to erroneous
conclusions. Hence, it is necessary that conclusions must be based on accurate figures.
(iv) Statistics must be collected in a systematic manner. If data are collected in a haphazard
manner, they will not be reliable and will lead to misleading conclusions.
(v) Collected in a systematic manner for a pre-determined purpose
(vi) Lastly, Statistics should be placed in relation to each other. If one collects data unrelated to
each other, then such data will be confusing and will not lead to any logical conclusions. Data
should be comparable over time and over space.
3
Basic Statistics
There are two major divisions of statistics such as descriptive statistics and inferential
statistics. Descriptive statistics; deals with collecting, summarizing, and simplifying data,
which are otherwise quite unwieldy and voluminous. It seeks to achieve this in a manner that
meaningful conclusions can be readily drawn from the data. Descriptive statistics may thus be
seen as comprising methods of bringing out and highlighting the latent characteristics present
in a set of numerical data. It not only facilitates an understanding of the data and systematic
reporting thereof in a manner; and also makes them amenable to further discussion, analysis,
and interpretations.
The first step in any scientific inquiry is to collect data relevant to the problem in hand. When
the inquiry relates to physical and/or biological sciences, data collection is normally an
integral part of the experiment itself. In fact, the very manner in which an experiment is
designed, determines the kind of data it would require and/or generate. The problem of
identifying the nature and the kind of the relevant data is thus automatically resolved as soon
as the design of experiment is finalized. It is possible in the case of physical sciences. In the
case of social sciences, where the required data are often collected through a questionnaire
from a number of carefully selected respondents, the problem is not that simply resolved. For
one thing, designing the questionnaire itself is a critical initial problem. For another, the
number of respondents to be accessed for data collection and the criteria for selecting them
has their own implications and importance for the quality of results obtained. Further, the data
have been collected; these are assembled, organized, and presented in the form of appropriate
tables to make them readable. Wherever needed, figures, diagrams, charts, and graphs are also
used for better presentation of the data. A useful tabular and graphic presentation of data will
require that the raw data be properly classified in accordance with the objectives of
investigation and the relational analysis to be carried out. .
A well thought-out and sharp data classification facilitates easy description of the hidden data
characteristics by means of a variety of summary measures. These include measures of central
tendency, dispersion, skewness, and kurtosis, which constitute the essential scope of
descriptive statistics. These form a large part of the subject matter of any basic textbook on
the subject, and thus they are being discussed in that order here as well.
4
Basic Statistics
Inferential statistics, also known as inductive statistics, goes beyond describing a given
problem situation by means of collecting, summarizing, and meaningfully presenting the
related data. Instead, it consists of methods that are used for drawing inferences, or making
broad generalizations, about a totality of observations on the basis of knowledge about a part
of that totality. The totality of observations about which an inference may be drawn, or a
generalization made, is called a population or a universe. The part of totality, which is
observed for data collection and analysis to gain knowledge about the population, is called a
sample.
The desired information about a given population of our interest; may also be collected even
by observing all the units comprising the population. This total coverage is called census.
Getting the desired value for the population through census is not always feasible and
practical for various reasons. Apart from time and money considerations making the census
operations prohibitive, observing each individual unit of the population with reference to any
data characteristic may at times involve even destructive testing. In such cases, obviously, the
only recourse available is to employ the partial or incomplete information gathered through a
sample for the purpose. This is precisely what inferential statistics does. Thus, obtaining a
particular value from the sample information and using it for drawing an inference about the
entire population underlies the subject matter of inferential statistics.
Consider a situation in which one is required to know the average body weight of all the
college students in a given cosmopolitan city during a certain year. A quick and easy way to
do this is to record the weight of only 500 students, from out of a total strength of, say, 10000,
or an unknown total strength, take the average, and use this average based on incomplete
weight data to represent the average body weight of all the college students. In a different
situation, one may have to repeat this exercise for some future year and use the quick estimate
of average body weight for a comparison. This may be needed, for example, to decide
whether the weight of the college students has undergone a significant change over the years
compared.
that all the five cells are in perfectly good condition. This information may be used to
conclude that the entire lot is good enough to buy or not.
Since this inference is based on the examination of a sample of limited number of cells, it is
equally likely that all the cells in the lot are not in order. It is also possible that all the items
that may be included in the sample are unsatisfactory. This may be used to conclude that the
entire lot is of unsatisfactory quality, whereas the fact may indeed be otherwise. It may, thus,
be noticed that there is always a risk of an inference about a population being incorrect when
based on the knowledge of a limited sample. The rescue in such situations lies in evaluating
such risks. For this, statistics provides the necessary methods. These centers on quantifying in
probabilistic term the chances of decisions taken on the basis of sample information being
incorrect. This requires an understanding of the what, why, and how of probability and
probability distributions to equip ourselves with methods of drawing statistical inferences and
estimating the degree of reliability of these inferences.
1 Collection of data: The first and for most step to be carried in statistical analysis is data
collection. This is the step at which data about the problem to be investigated should be
collected. Data are very important elements on which the result of the analysis will depend. If
cares are not taken during data collection, that is, if wrong data are collected, then the result
which will be obtained from analyzing such wrong data will be wrong. These wrong results
will make the analyst to conclude wrongly, which in turn mislead decision makers.
Data can be collected in a variety of ways; one of the most common methods is
through the use of survey. Survey can also be done in different methods, three of the most
common methods are:
Interview
Mailed questionnaire
Interview
Postal interview
6
Basic Statistics
Advantages:
Can cover a large number of people or organizations
No prior arrangements are needed
No interviewer bias
Disadvantages:
Little opportunity to use visual aids
Low response rate
Can’t reach all type of people
Not possible to give assistance if required
B. Personal interview
Advantages:
Disadvantages:
7
Basic Statistics
training is required
C. Telephone interview
Advantages:
Quick
Can cover reasonably large numbers of people or organizations
Wide geographic coverage
High response rate –keep going till the required number
No waiting
Spontaneous response
Help can be given to the respondent
Can tape(tie) answers
Disadvantages:
Observation
8
Basic Statistics
Advantages
Disadvantages
Designing a Questionnaire
To maximize the proportion of subjects answering our questionnaire-that is, the response rate.
To obtain accurate relevant information for our survey
Types of questions
Closed-ended questions are questions that can only be answered by selecting from a limited
number of options, usually multiple-choice, 'yes' or 'no', or a rating scale (e.g. from strongly
agree to strongly disagree).
Closed-ended questions give limited insight, but can easily be analyzed for quantitative data.
9
Basic Statistics
Example
Open questions
Exercise: discuss the advantage and disadvantage of the above three methods with respect to
each other.
2 Organization of data: Once data have been collected, the next step will be arranging the bulk
of data into simple and understandable manner. This includes classifying the data according to
their resemblance, and arranging data into tabular form by putting records (data) in rows and
columns., e.g table form
3 Presentation of the data: After data have been collected and organized, the following step will
be presentation. That means, we can transform the data into charts, graphs or we can present
using frequency distributions.
4. Analysis of data: This step is a step where the presented data will be investigated using
different methods of statistical techniques. Among different methods of analysing data, we
can mention some of the simple descriptive analysis such as dealing with measures of central
tendencies, measures of variations and so on.
5. Inference of data: The final step to take place while conducting statistical investigation is
interpreting the results obtained preceding steps. This includes, giving appropriate
conclusions based on the obtained values from the data. That means, we need to transform the
numerical results computed from the gathered data into statements regarding the problem
under investigation so that decision makers will easily understand and make decisions based
on the drawn conclusions.
10
Basic Statistics
Statistical data are the basic raw material of statistics. Data may relate to an activity of our
interest, a phenomenon, or a problem situation under study. They derive as a result of the
process of measuring, counting and/or observing. Statistical data, therefore, refer to those
aspects of a problem situation that can be measured, quantified, counted, or classified. Any
object subject phenomenon, or activity that generates data through this process is termed as a
variable.
In other words, a variable is one that shows a degree of variability when successive
measurements are recorded. In statistics, variables are classified into two broad categories:
quantitative data and qualitative variables. This classification is based on the kind of
characteristics that are measured.
i) Quantitative variables are those that can be quantified in definite units of measurement. These
refer to characteristics whose successive measurements yield quantifiable observations.
11
Basic Statistics
Depending on the nature of the variable observed for measurement, quantitative variables can
be further categorized as continuous and discrete variables.
12
Basic Statistics
b) Rank variables, on the other hand, are the result of assigning ranks to specify order in terms
of the integers 1, 2, 3... n. Ranks may be assigned according to the level of performance in a
test. a contest, a competition, an interview, or a show. The candidates appearing in an
interview, for example, may be assigned ranks in integers ranging from I to n, depending on
their performance in the interview. Ranks so assigned can be viewed as the continuous values
of a variable involving performance as the quality characteristic.
Data sources could be seen as of two types, viz., secondary and primary.
Depending on the source data’s can be classified as:
(i)Secondary data: They already exist in some form: published or unpublished - in an identifiable
secondary source. They are, generally, available from published source(s), though not
necessarily in the form actually required.
(ii) Primary data: Those data which do not already exist in any form, and thus have to be
collected for the first time from the primary source(s). By their very nature, these data require
fresh and first-time collection covering the whole population or a sample drawn from it.
Proper knowledge about the nature and type of data to be dealt with is essential in order to
specify and apply the proper statistical method for their analysis and inferences. Measurement
scale refers to the property of value assigned to the data based on the properties of order,
distance and fixed zero.
In mathematical terms measurement is a functional mapping from the set of objects {Oi} to
the set of real numbers {M(Oi)}.
13
Basic Statistics
The goal of measurement systems is to structure the rule for assigning numbers to objects in
such a way that the relationship between the objects is preserved in the numbers assigned to
the objects. The different kinds of relationships preserved are called properties of the
measurement system.
Order
The property of order exists when an object that has more of the attribute than another object,
is given a bigger number by the rule system. This relationship must hold for all objects in the
"real world".
Distance
The property of distance is concerned with the relationship of differences between objects. If
a measurement system possesses the property of distance it means that the unit of
measurement means the same thing throughout the scale of numbers. That is, an inch is an
inch, no matters were it falls - immediately ahead or a mile downs the road.
More precisely, an equal difference between two numbers reflects an equal difference in the
"real world" between the objects that were assigned the numbers. In order to define the
14
Basic Statistics
property of distance in the mathematical notation, four objects are required: Oi, Oj, Ok, and Ol
. The difference between objects is represented by the "-" sign; Oi - Oj refers to the actual
"real world" difference between object i and object j, while M(Oi) - M(Oj) refers to
differences between numbers.
Fixed Zero
A measurement system possesses a rational zero (fixed zero) if an object that has none of the
attribute in question is assigned the number zero by the system of rules. The object does not
need to really exist in the "real world", as it is somewhat difficult to visualize a "man with no
height". The requirement for a rational zero is this: if objects with none of the attribute did
exist would they be given the value zero. Defining O0 as the object with none of the attribute
in question, the definition of a rational zero becomes:
The property of fixed zero is necessary for ratios between numbers to be meaningful.
Nominal Scales
Nominal scales are measurement systems that possess none of the three properties stated
above.
Level of measurement which classifies data into mutually exclusive, all-inclusive categories
in which no order or ranking can be imposed on the data.
15
Basic Statistics
Ordinal Scales
Ordinal Scales are measurement systems that possess the property of order, but not the
property of distance. The property of fixed zero is not important if the property of distance is
not satisfied.
1. Level of measurement which classifies data into categories that can be ranked. Differences
between the ranks do not exist.
Arithmetic operations are not applicable but relational operations are applicable.
Examples:
Letter grades (A, B, C, D, F).
Rating scales (Excellent, very good, Good, Fair, poor).
Military status.
Interval Scales
Interval scales are measurement systems that possess the properties of Order and distance, but
not the property of fixed zero. Level of measurement which classifies data that can be ranked
and differences are meaningful. However, there is no meaningful zero, so ratios are
meaningless.
16
Basic Statistics
Ratio Scales
Ratio scales are measurement systems that possess all three properties: order, distance, and
fixed zero. The added power of a fixed zero allows ratios of numbers to be meaningfully
interpreted; i.e. the ratio of Bekele's height to Martha's height is 1.32, whereas this is not
possible with interval scales.
Level of measurement which classifies data that can be ranked, differences are meaningful,
and there is a true zero. True ratios exist between the different units of measure.
Examples:
Weight
Height
Number of students
Age
17
Basic Statistics
EXERCISE
The following present a list of different attributes and rules for assigning numbers to objects.
Try to classify the different measurement systems into one of the four types of scales.
There are three major functions in any business enterprise in which the statistical methods are
useful. These are as follows:
18
Basic Statistics
(i) The planning of operations: This may relate to either special projects or to the recurring activities
of a firm over a specified period.
(ii) The setting up of standards: This may relate to the size of employment, volume of sales,
fixation of quality norms for the manufactured product, norms for the daily output, and so
forth.
(iii) The function of control: This involves comparison of actual production achieved against the
norm or target set earlier. In case the production has fallen short of the target, it gives
remedial measures so that such a deficiency does not occur again.
A worth noting point is that although these three functions-planning of operations, setting
standards, and control-are separate, but in practice they are very much interrelated.
Different authors have highlighted the importance of Statistics in business. For instance,
Croxton and Cowden give numerous uses of Statistics in business such as project planning,
budgetary planning and control, inventory planning and control, quality control, marketing,
production and personnel administration. Within these also they have specified certain areas
where Statistics is very relevant. Another author, Irwing W. Burr, dealing with the place of
statistics in an industrial organization, specifies a number of areas where statistics is
extremely useful. These are: customer wants and market research, development design and
specification, purchasing, production, inspection, packaging and shipping, sales and
complaints, inventory and maintenance, costs, management control, industrial engineering
and research.
Statistical problems arising in the course of business operations are multitudinous. As such,
one may do no more than highlight some of the more important ones to emphasis the
relevance of statistics to the business world. In the sphere of production, for example,
statistics can be useful in various ways.
Statistical quality control methods are used to ensure the production of quality goods.
Identifying and rejecting defective or substandard goods achieve this. The sale targets can be
fixed on the basis of sale forecasts, which are done by using varying methods of forecasting.
Analysis of sales affected against the targets set earlier would indicate the deficiency in
achievement, which may be on account of several causes:
19
Basic Statistics
Another sphere in business where statistical methods can be used is personnel management.
Here, one is concerned with the fixation of wage rates, incentive norms and performance
appraisal of individual employee. The concept of productivity is very relevant here. On the
basis of measurement of productivity, the productivity bonus is awarded to the workers.
Comparisons of wages and productivity are undertaken in order to ensure increases in
industrial productivity.
Statistical methods could also be used to ascertain the efficacy of a certain product, say,
medicine. For example, a pharmaceutical company has developed a new medicine in the
treatment of bronchial asthma. Before launching it on commercial basis, it wants to ascertain
the effectiveness of this medicine. It undertakes an experimentation involving the formation
of two comparable groups of asthma patients. One group is given this new medicine for a
specified period and the other one is treated with the usual medicines. Records are maintained
for the two groups for the specified period. This record is then analysed to ascertain if there is
any significant difference in the recovery of the two groups. If the difference is really
significant statistically, the new medicine is commercially launched.
Apart from the methods comprising the scope of descriptive and inferential branches of
statistics, statistics also consists of methods of dealing with a few other issues of specific
nature. Since these methods are essentially descriptive in nature, they have been discussed
here as part of the descriptive statistics. These are mainly concerned with the following:
(i)It often becomes necessary to examine how two paired data sets are related. For example, we may
have data on the sales of a product and the expenditure incurred on its advertisement for a
specified number of years. Given that sales and advertisement expenditure are related to each
other, it is useful to examine the nature of relationship between the two and quantify the
20
Basic Statistics
degree of that relationship. As this requires use of appropriate statistical methods, these falls
under the purview of what we call regression and correlation analysis.
(ii) Situations occur quite often when we require averaging (or totalling) of data on prices and/or
quantities expressed in different units of measurement. For example, price of cloth may be
quoted per meter of length and that of wheat per kilogram of weight. Since ordinary methods
of totalling and averaging do not apply to such price/quantity data, special techniques needed
for the purpose are developed under index numbers.
(iii) Many a time, it becomes necessary to examine the past performance of an activity with a view
to determining its future behaviour. For example, when engaged in the production of a
commodity, monthly product sales are an important measure of evaluating performance. This
requires compilation and analysis of relevant sales data over time. The more complex the
activity, the more varied the data requirements. For profit maximizing and future sales
planning, forecast of likely sales growth rate is crucial. This needs careful collection and
analysis of past sales data. All such concerns are taken care of under time series analysis.
(iv) Obtaining the most likely future estimates on any aspect(s) relating to a business or economic
activity has indeed been engaging the minds of all concerned. This is particularly important
when it relates to product sales and demand, which serve the necessary basis of production
scheduling and planning. The regression, correlation, and time series analyses together help
develop the basic methodology to do the needful. Thus, the study of methods and techniques
of obtaining the likely estimates on business/economic variables comprises the scope of what
we do under business forecasting.
Keeping in view the importance of inferential statistics, the scope of statistics may finally be
restated as consisting of statistical methods which facilitate decision--making under
conditions of uncertainty. While the term statistical methods is often used to cover the subject
of statistics as a whole, in particular it refers to methods by which statistical data are analysed,
interpreted, and the inferences drawn for decision-making.
Though generic in nature and versatile in their applications, statistical methods have come to
be widely used, especially in all matters concerning business and economics. These are also
being increasingly used in biology, medicine, agriculture, psychology, and education. The
scope of application of these methods has started opening and expanding in a number of
21
Basic Statistics
social science disciplines as well. Even a political scientist finds them of increasing relevance
for examining the political behavior and it is, of course, no surprise to find even historians
statistical data, for history is essentially past
(i) There are certain phenomena or concepts where statistics cannot be used. This is because these
phenomena or concepts are not amenable to measurement. For example, beauty, intelligence,
courage cannot be quantified. Statistics has no place in all such cases where quantification is
not possible.
(ii) Statistics reveal the average behaviour, the normal or the general trend. An application of the
'average' concept if applied to an individual or a particular situation may lead to a wrong
conclusion and sometimes may be disastrous. For example, one may be misguided when told
that the average depth of a river from one bank to the other is four feet, when there may be
some points in between where its depth is far more than four feet. On this understanding, one
may enter those points having greater depth, which may be hazardous.
(iii) Since statistics are collected for a particular purpose, such data may not be relevant or useful
in other situations or cases. For example, secondary data (i.e., data originally collected by
someone else) may not be useful for the other person.
(iv) Statistics are not 100 per cent precise as is Mathematics or Accountancy. Those who use
statistics should be aware of this limitation.
(v) In statistical surveys, sampling is generally used as it is not physically possible to cover all the
units or elements comprising the universe. The results may not be appropriate as far as the
universe is concerned. Moreover, different surveys based on the same size of sample but
different sample units may yield different results.
(vi) At times, association or relationship between two or more variables is studied in statistics, but
such a relationship does not indicate cause and effect' relationship. It simply shows the
similarity or dissimilarity in the movement of the two variables. In such cases, it is the user
who has to interpret the results carefully, pointing out the type of relationship obtained.
(vii) A major limitation of statistics is that it does not reveal all pertaining to a certain
phenomenon. There is some background information that statistics does not cover. Similarly,
22
Basic Statistics
there are some other aspects related to the problem on hand, which are also not covered. The
user of Statistics has to be well informed and should interpret Statistics keeping in mind all
other aspects having relevance on the given problem.
Apart from the limitations of statistics mentioned above, there are misuses of it. Many people,
knowingly or unknowingly, use statistical data in wrong manner. Let us see what the main
misuses of statistics are so that the same could be avoided when one has to use statistical data.
The misuse of Statistics may take several forms some of which are explained below.
(i) Sources of data not given: At times, the source of data is not given. In the absence of the source,
the reader does not know how far the data are reliable. Further, if he wants to refer to the
original source, he is unable to do so.
(ii) Defective data: Another misuse is that sometimes one gives defective data. This may be done
knowingly in order to defend one's position or to prove a particular point. This apart, the
definition used to denote a certain phenomenon may be defective. For example, in case of
data relating to unemployed persons, the definition may include even those who are
employed, though partially. The question here is how far it is justified to include partially
employed persons amongst unemployed ones.
(iii) Unrepresentative sample: In statistics, several times one has to conduct a survey, which
necessitates to choose a sample from the given population or universe. The sample may turn
out to be unrepresentative of the universe. One may choose a sample just on the basis of
convenience. He may collect the desired information from either his friends or nearby
respondents in his neighborhood even though such respondents do not constitute a
representative sample.
(iv) Inadequate sample: Earlier, we have seen that a sample that is unrepresentative of the universe
is a major misuse of statistics. This apart, at times one may conduct a survey based on an
extremely inadequate sample. For example, in a city we may find that there are 1, 00,000
households. When we have to conduct a household survey, we may take a sample of merely
100 households comprising only 0.1 per cent of the universe. A survey based on such a small
sample may not yield right information.
(v) Unfair Comparisons: An important misuse of statistics is making unfair comparisons from the data
collected. For instance, one may construct an index of production choosing the base year
23
Basic Statistics
where the production was much less. Then he may compare the subsequent year's production
from this low base.
Such a comparison will undoubtedly give a rosy picture of the production though in reality it
is not so. Another source of unfair comparisons could be when one makes absolute
comparisons instead of relative ones. An absolute comparison of two figures, say, of
production or export, may show a good increase, but in relative terms it may turn out to be
very negligible. Another example of unfair comparison is when the population in two cities is
different, but a comparison of overall death rates and deaths by a particular disease is
attempted. Such a comparison is wrong. Likewise, when data are not properly classified or
when changes in the composition of population in the two years are not taken into
consideration, comparisons of such data would be unfair as they would lead to misleading
conclusions.
There are three major functions in any business enterprise in which the statistical methods are
useful. These are as follows:
24
Basic Statistics
(i) The planning of operations: This may relate to either special projects or to the recurring activities
of a firm over a specified period.
(ii) The setting up of standards: This may relate to the size of employment, volume of sales,
fixation of quality norms for the manufactured product, norms for the daily output, and so
forth.
(iii) The function of control: This involves comparison of actual production achieved against the
norm or target set earlier. In case the production has fallen short of the target, it gives
remedial measures so that such a deficiency does not occur again.
A worth noting point is that although these three functions-planning of operations, setting
standards, and control-are separate, but in practice they are very much interrelated. Different
authors have highlighted the importance of Statistics in business. For instance, Croxton and
Cowden give numerous uses of Statistics in business such as project planning, budgetary
planning and control, inventory planning and control, quality control, marketing, production
and personnel administration. Within these also they have specified certain areas where
Statistics is very relevant. Another author, Irwing W. Burr, dealing with the place of statistics
in an industrial organization, specifies a number of areas where statistics is extremely useful.
These are: customer wants and market research, development design and specification,
purchasing, production, inspection, packaging and shipping, sales and complaints, inventory
and maintenance, costs, management control, industrial engineering and research.
Statistical problems arising in the course of business operations are multitudinous. As such,
one may do no more than highlight some of the more important ones to emphasis the
relevance of statistics to the business world. In the sphere of production, for example,
statistics can be useful in various ways. Statistical quality control methods are used to ensure
the production of quality goods. Identifying and rejecting defective or substandard goods
achieve this. The sale targets can be fixed on the basis of sale forecasts, which are done by
using varying methods of forecasting. Analysis of sales affected against the targets set earlier
would indicate the deficiency in achievement, which may be on account of several causes: (i)
targets were too high and unrealistic (ii) salesmen's performance has been poor (iii)
emergence of increase in competition (iv) poor quality of company's product, and so on.
These factors can be further investigated.
25
Basic Statistics
Another sphere in business where statistical methods can be used is personnel management.
Here, one is concerned with the fixation of wage rates, incentive norms and performance
appraisal of individual employee. The concept of productivity is very relevant here. On the
basis of measurement of productivity, the productivity bonus is awarded to the workers.
Comparisons of wages and productivity are undertaken in order to ensure increases in
industrial productivity. Statistical methods could also be used to ascertain the efficacy of a
certain product, say, medicine. For example, a pharmaceutical company has developed a new
medicine in the treatment of bronchial asthma. Before launching it on commercial basis, it
wants to ascertain the effectiveness of this medicine. It undertakes an experimentation
involving the formation of two comparable groups of asthma patients. One group is given this
new medicine for a specified period and the other one is treated with the usual medicines.
Records are maintained for the two groups for the specified period. This record is then
analyzed to ascertain if there is any significant difference in the recovery of the two groups. If
the difference is really significant statistically, the new medicine is commercially launched.
1.5 Summary
26
Basic Statistics
1. Define Statistics. Explain its types, and importance to trade, commerce and business.
27
Basic Statistics
Introduction
The first step in any statistical investigation is collecting relevant data. After the data has been
collected they have to be organized and presented in a systematic manner. Common methods of
presenting numerical data are frequency distributions, diagrammatic and graphical methods.
Frequency distributions are of different types. These are: ungrouped or discrete, qualitative,
grouped or continuous, relative, and cumulative frequency distributions.
Based on the frequency distributions mentioned above we can present our data using different
diagrams, graphs, stem and leaf and dot plot. The different types of diagrams will be discussed
are bar charts (simple, component, percentage component, multiple, and pie chart). We have
different types of graphs for presentation of data such as histogram, frequency polygon,
cumulative frequency curves (ogive curves), line graph and vertical line graph.
Learning Outcomes
At the end of this chapter students will be able to:
Construct relative and cumulative frequency distribution for raw data
Identify class marks, class width and class boundaries
Present of numerical data using graphs and charts
Identify between categorical and continuous frequency distributions
Follow the principles to be followed for constructing frequency distributions
Identify between the 'less than' and 'more than' cumulative frequency distributions
Distinguish between the different types of diagrams
Distinguish between the different types of graphs
Construct different graphs, diagrams, stem and leaf and dot plots for a given data set
Basic Concepts
In any statistical investigation the first step is to collect a set of related observations (data)
from which conclusions may be drawn.
28
Basic Statistics
Data: are a set of related information (facts) from which statistical conclusion may be drawn
or data are a real value of the variable.
Variable: It is a characteristic that can assume different values. Based on information desired
variables can be classified as qualitative and quantitative. Qualitative Data are data which are
non-numeric in nature and can’t be measured. A qualitative data is a data that cannot be
described numerically.
Examples: Number of employees, Soil type of a fruit farm, Sex of a patient, eye color of
Ostrich, Gender, Religion, type of Sport (football, athletics, ...) etc.
Quantitative Data: are data that can be expressed numerically or are data that are numeric in
nature. Quantitative data can be further classified as discrete or continuous.
a. Discrete Data: A data that assumes a finite or countable number of possible values. Discrete
data are usually obtained by counting.
Example: Number of customers, Number of tourists in Jimma, number of children in a family,
etc.
b. Continuous Data: A data that can theoretically assume infinite number of possible values.
Continuous data are obtained by measuring.
Examples: amount of money in a certain account, Yield of wheat from certain farm, area of
crop land in m2 etc.
The nature of data we obtain depends on the nature of the study and the population on the
characteristics in interest. Due to this reason, we have different types of data under different
basis of classification.
A. Classifications by Sources
The statistical data may be classified under two categories, depending upon the sources. These
are Primary and - Secondary data.
Primary Data: are those data, which are collected by the investigator himself for the purpose
of a specific inquiry or study. Such data are original in character and are mostly generated by
29
Basic Statistics
Primary method of data collection consists of obtaining data or information by any of the
following methods
Direct personal Interview: is a conversation between two people that is initiated by the
interviewer (researcher) in order to obtain the required information. The interviewer (usually
the investigator) sets series of questions directly related to his work in advance and conducts
the interview. Tape records and other necessary materials might be taken with the interviewer.
Mailed Questionnaires: questionnaires are sent by post to the informants together with a
polite covering letter and they return to back with answers for the researcher.
Self-administered questionnaires: is a method of data collection in which researcher’s give
well organized questionnaire directly to the respondents.
Secondary Data: When an investigator uses data, which have already been collected by
others, such data are called "Secondary Data". Data are primary data for the agency that
collected them, and become secondary for someone else who uses these data for his own
purposes. Secondary data can be obtained from journals, reports, government and non-
government publications, publications of professionals and research organizations (in general
Published and unpublished sources).
Secondary data are less expensive to collect both in money and time while in most cases,
however, secondary data must be used with utmost care because:
30
Basic Statistics
According to the role of time, data are classified in to cross-section and time series data.
Time series data is a set of observations collected for a sequence of times, usually at equal
interval which may be on weekly, monthly, quarterly, yearly, etc basis.
Source of Data
Essentially, we have two categories of data, namely primary data and secondary data.
Primary data: are data which are collected from the units or individual respondents directly
for the purpose of certain study or information.
These data are original in character, collected to meet the specific problem needs at hand.
Is collected by immediate users of the data, for the first time.
Example:
Secondary data: are data which are taken from the records of institutions that collect and
publish statistics as part of their routine duties. These are already existing which has
previously been collected and reported by some individual organization for their own purpose
and at latter stage some of the data will come to be made available to other individuals or
organization
Example:
Data collection is the first task to be carried out in statistical analysis. There are two methods
of data collection, namely primary and secondary methods of data collection.
The primary method consists of obtaining data or information by any of the following ways:
Advantage: Gives relatively more accurate data on behavior and activities. The method is
independent of respondent’s willingness to respond.
Disadvantages: Investigator’s or observer’s own bias, prejudice, desires, and etc. and needs
more resources and skilled human power during the use of high-level machines. Information
provided by this method is very limited because unforeseen factors may interfere with the
observational facts.
b) Personal interview
This involves presentation of oral verbal stimuli and reply in terms of oral verbal response.
These are some commonly used data collection techniques. Therefore, designing a good tool
which will serve for collecting information is a vital task that requires due attention while
developing research proposals.
32
Basic Statistics
Studies with many respondents often use shorter, highly structured questionnaires, whereas
smaller studies allow more flexibility and may use questionnaires with a number of open-
ended questions.
Once the decision has been made interviews may be less or more structured. Unstructured
interview is flexible, the content, wording and order of the questions vary from interview to
interview.
Standardized methods (where the wording and order of the questions are decided in advance)
of asking questions are usually preferred in community research, since they provide more
assurance that the data will be reproducible.
There is also another method of data collection where selected correspondents take part. This
is a type of data collection in which those correspondents are used to collect data according to
the guidelines given by the survey conducting institutions.
33
Basic Statistics
Under this method, the investigator prepares a questionnaire containing a number of questions
pertaining to the field of inquiry. The questions are sent by mails to the informants together
with a covering letter requesting the respondents to cooperate on giving the correct responses
and returning back the filed in questionnaire, and explaining such details as the objectives of
the data to be collected, description as to how the questionnaire should be filed in, how
important the responses of every selected respondent is, and so on.
Types of questionnaires
Open ended questionnaire: -permits free response that should be recorded by the
respondents own word. This type of questionnaire is use full to obtain information on
sensitive issues, opinions and facts not very familiar to the researcher.
Close ended question: -offers a list of options from which the respondents choose. This
question has only two possible answers (Yes/ No true or false).
Multiple choice questions: -The respondent is governed by the choice and selects one of the
alternative possible answers.
In designing a questionnaire
Merits
34
Basic Statistics
Demerits
The data obtained through these methods are called primary data, as explained above.
The other method of data collection, secondary method, is a method by which we collect data
from secondary sources such as administrative records, books, survey results and so on. Data
collected through such methods are known as secondary data, as has been discussed above.
In the previous chapter you have been introduced to the definition, applications uses and
limitations of statistics as well as the method of data collection and sampling from population.
i.e., introduction. Once the data have been collected, we will have a mass of raw data. Our
mind cannot readily grasp the overall content of such a mass data. Hence, we have to
condense or summaries them in the form of tables and graphs. This chapter deals with
presentation of numerical data using tables, diagrams and graphs.
Frequency distribution
Sorting of data into categories or classes will lead to formation of frequency distributions.
Frequency distribution gives the number of times a category or class occurs. The term
frequency is used to denote the number of times a category or class occurs. The frequency is
denoted by f and the class is denoted by X.
Statistical tables
Data can also be provided in statistical tables where the major advantages of such tables are:
i. Tabulated data can be easily understood than facts given in the form of description
ii. They facilitate quick comparisons
35
Basic Statistics
iii. Statistical tables make the summation of items and detection of errors and omissions easier
iv. When data are tabulated, all unnecessary details and repetitions are avoided.
Definition:
Raw data: recorded information in its original collected form, whether it is counts or
measurements, is referred to as raw data.
Frequency distribution: is the organization of raw data in table form using classes and
frequencies.
It Shows a distribution where the values of a variable are linked with respective frequencies
and mostly used for small data set. Discrete frequency distribution is one, which involves a
discrete variable.
36
Basic Statistics
4. Prepare three columns, the first for the different values of the variable, the second for tally
marks to facilitate the counting, the third for the frequency corresponding to each value of the
variable
5. Write the possible values of the variable in ascending order in the first column.
Example 2.1: The following data represents the number of books read in the past six months
by each student in a class of 25.
6 24 14 11 33
15 15 8 14 10
8 27 15 6 20
20 9 33 15 10
6 11 20 8 6
Array of number of books read in the past six months by each student in a class of 25.
6 6 6 6 8 8 8 9 10 10 11 11
14 14 15 15 15 15 20 20 20 24 27 33
33
37
Basic Statistics
Since the variable “Number of books read” can assume only the values 0,1,2,3,4,5,6…,
(which are whole numbers) it is a discrete variable.
4
3
1
Total 25
Used for data that can be placed in to specific categories such as nominal or ordinal e.g.,
marital status.
38
Basic Statistics
Step 2: step 2: Tally the data and place the result in column (2).
Step 3: Count the tally and place the result in column (3).
f
% *100 Where f= frequency of the class, n=total number of values.
n
Percentages are not normally a part of frequency distribution but they can be added since they
are used in certain types diagrammatic such as pie charts.
Example 2.2: A social worker collected the following data on marital status for 25 persons.
M S D W D
S S M M M
W D S M M
W D D S S
39
Basic Statistics
S W W D D
M |||| 5 20
S |||||| 7 28
D |||||| 7 28
W ||||| 6 24
Example 2.3: The following table shows the frequency distribution of the test results of 50
students in Statistics course.
8
14 – 17 15
18 – 21 15
22 – 25 7
40
Basic Statistics
26 _ 30 5
Total 50
The categories in to which the observations are distributed are called classes or class intervals.
The classes should be set so that they contain all items and no two classes share the same
item. This is the basic principle in the construction of such frequency distributions. We will
define some concepts associated with continuous frequency distributions in the following
way.
Class limits: In the above table the students are distributed in to different classes. There are 8
students with scores between 10 and 13. The numbers 10 and 13 are called lower- and upper-
class limits, respectively. There are 15 students with scores between 14 and 17. The numbers
14 and 17 are called lower- and upper-class limits. Respectively
Class limits are therefore the lowest and highest values that can be included in a class. In the
above examples, the numbers 10, 14, 18 and 22 are called the lower-class limits (LCL) and
the numbers 13, 17, 21 and 25 are called the upper-class limits (UCL)
Class boundaries (real class limits): A class boundary is a number that does not appear in
the stated class limits but is rather a value that falls midway between the upper limit of one
class and the lower limit of the next large one.
In practice, the class boundaries are obtained by adding the upper-class limit of one class
interval to the lower limit of the next higher-class interval and dividing it be 2.
Then adding ½ d to upper limits gives the upper-class boundaries (UCB) and subtracting ½ d
from the lower limits gives the lower-class boundary (LCB)
d= 14 – 13 =1
41
Basic Statistics
½ d = 0.5
9.5, 13.5 17.5 and 21.5 which are the lower-class boundaries
Adding 0.5 to the upper limits gives 13.5, 17.5, 21.5 and 35.5 which are the upper-class
boundaries.
10-13 8 9.5-13.5
18-21 20 17.5-21.5
22-25 20 21.5-25.5
Total 50
Or Class boundaries are obtained by subtracting half of the unit of measurement (u) form the
lower limits and by adding half of u on the upper-class limits of a class.
Where u is the distance between two possible consecutive measures. It is usually taken as 1,
0.1, 0.01, 0.001,
u u
Then LCBi LCLi and UCBi UCLi
2 2
For the data in the above example, consider the 2nd class 14-17, since u =1,
Class width (class size): The size or width of a class interval is the difference between the
upper- and lower-class boundaries and is preferred to as the class width, class size or class
length
In the above table, for the first class, the class width is 13.5 -9.5 =4 and the second class 14-
17 have class width 17.5 -13.5 =4. In this table all classes have equal size which is 4.
When all the classes are of the same size the class width can also be obtained as the difference
between any two consecutive lower limits or upper limits EX: see the above table.
Class mark or class mid-point or the class interval: is a value which lies mid-way between
the lower and upper limits of the class and is obtained by adding the lower- and upper-class
limits and dividing the sum by two.
LCL UCL
class mark CM
i.e. 2
LCB UCB
2
Note that when the class size is uniform in a distribution, after finding the class mark of the
first class the remaining are obtained by adding the class size. So, in the case of classes with
the same size, the class width can also be obtained as the difference between any two
consecutive class marks.
43
Basic Statistics
For the distribution of the above table, the class mark of the first class is 11.5 then the class
mark of the second class in 11.5+4 = 15.5 the class mark of the 3rd class is 15.5+4=19.5 and
that of the fourth class is 19.5+4 = 23.5 then we can have the following table.
Total 50
1. Determine the number of classes that will be used to group the data.
The number of classes should be neither so large as to destroy the advantage of classification,
nor be so small that the chief characteristic of the data is missed. The exact number of classes
44
Basic Statistics
to use depends upon the number of figures to be classified, the size of figures, the purpose that
data has to serve and the arbitrary preference of the analyst.
A small number of items to be classified justify a small number of classes. For example, if we
classify 30 items into 20 classes, we would lose more than we gain from the classification. If,
on the other hand, we classify 15,000 items in to 5 classes we would probably give away too
much information.
So, in general the approximate number of classes depends upon the number of measurements
and the following rough information gives us a good hint.
Sturges’ Rule
To fix the number of classes (k) one can use the above method, a personal judgment
depending up on the nature of investigation or decide with the help of Sturges’ Rule, stating
that
Number of classes = k=
1+ 3.22 x log(n)
Generally, the number of classes should be between 5 and 20. That is, not less than 5 and not
greater than 20 classes should be used for any kind of distribution.
Whenever possible, all classes should be of the same size. This facilitates the analysis of the
data and simplifies comparison between different classes.
A frequency distribution with equal class size can be presented pictorially with greater ease.
45
Basic Statistics
If the number of classes is known and if it is decided to use classes of equal size, the
determination of the size is relatively simple. Since the class size depends upon the number of
classes and the extent to which the values of the variable are spread or dispersed, the
following simple formula can be used.
Class width
Range
or cw
Number of classes
R
k
3. Determine the lower-class limit of the first class so that the smallest item falls in this class. The
remaining lower class limits are obtained using the following relations.
LCL2 = LCL1 + cw, LCL3 = LCL2 + cw, LCL4 = LCL3 + cw, , LCLi+1 = LCLi + cw
4. Determine the upper-class limit of the first class using the formula
UCL1 = LCL1 + cw _ u. The remaining lower-class limits are obtained using the following
relations.
5. Complete the continuous frequency distribution with the respective class frequencies.
41 50 69 77 88 92 40 51 67 75 87 94 93 86 72 62 53 49 57 67
70 85 97 95 83 79 68 52 44 44 55 64 75 83 74 60 56 42 56 69
70 42 64 52 63 60 59 61 65 78
46
Basic Statistics
R 97 40
Step2. cw = 8.142857143
k 7
the construction of the distribution and further the analysis of the data,
LC3L7= 85 + 9 = 94.
47
Basic Statistics
67 __ 75 11 66.5 __ 75.5 71
76 __ 84 5 75.5 __ 84.5 80
85 __ 93 6 84.5 __ 93.5 89
94 __ 102 3 93.5 __
98
102.5
Total 50
The relative frequency of a class shows the relative concentration of items in a given class
interval to the other classes of a frequency distribution.
class frequency
Relative frequency of a class =
Total frequency
Example 2.5: The following table shows and example of relative frequency distribution.
48
Basic Statistics
The cumulative frequency of value of a variable (a class) is the sum of all the frequencies
preceding or succeeding that value (class) including the frequency of that value (class) there
are two types of cumulative frequency distributions namely the “less than” cumulative
frequency distribution and the “more than” cumulative frequency distribution.
Less than cumulative frequency for any value of the variable (or class) is obtained by adding
values (or classes), including the frequency of that value (class) against which the totals are
written, provided the values (Classes) are arranged in ascending order of magnitude. Or for
grouped frequency distribution it is the sum of all frequencies lying below the upper class
boundaries of each class.
Example 2.6: The table below shows the ‘less than’ cumulative frequency distribution of
marks of 70 students in a class.
35-40 10 5+10=15
40-45 15 15+15=30
45-50 30 30+30=60
50-55 5 60+5=65
55-60 5 65+5=7
The above ‘less than’ cumulative frequency distribution can also be written as follows
Marks Frequency
49
Basic Statistics
Less than 30 0
Less than 35 5
Less than 40 15
Less than 45 30
Less than 50 60
Less than 55 65
Less than 60 70
The ‘more than’ cumulative frequency is obtained similarly by finding the cumulative totals
of frequencies starting from the highest value of the variable (class) to the lowest value
(class). Thus, in the above illustration the number of students with marks ‘more than 50’ is
5+5= 10, and ‘more than 40’ is 15+30+5+5=55 and so on. The complete ‘more than’ type
cumulative frequency distribution for this data is given below:
‘More than’
Marks Frequency cumulative
frequency
30-35 5 65+5=70
35-40 10 55+10=65
40-45 15 40+15=55
45-50 30 10+30=40
50-55 5 5+5=10
55-60 5 5
The above ‘more than’ C.F. Distribution can also be expressed in the following form:
50
Basic Statistics
Number of
Marks
students
More than 30 70
More than 35 65
More than 40 55
More than 45 40
More than 50 10
More than 55 5
More than 60 0
Remark: In ‘less than’ C.F. Distribution, the c.f. refers to the upper-class boundary of the
corresponding class and in ‘more than’ cumulative frequency distribution, the c.f. refers to the
lower-class boundary of the corresponding class
Histogram
A histogram is a graphical display of the distribution of a data set. A histogram looks like a
vertical bar graph, except that the columns touch each other.
The given grouped data is plotted in the form of a series of rectangles. Class boundaries are
marked along the x-axis and the frequencies along the y- axis according to a suitable scale. If
all the classes are of the same size, the height of the rectangles can be taken to be numerically
equal to the class frequencies.
If on the other hand the size of the class intervals is not uniform, the height of the rectangles
can be adjusted by taking the “frequency density” of the corresponding classes as scale for the
vertical axis.
51
Basic Statistics
class Frequency
Frequency density
Class width
A histogram gives us an idea about the shape of the data distribution. It can indicate to us,
graphically, where the center of the data distribution lies. It will also reveal whether the
distribution is symmetric or skewed.
Class fi Class
limits boundary
10-19 4 9.5-19.5
20-29 5 19.5-29.5
30-39 8 29.5-39.5
40-49 6 39.5-49.5
50-59 2 49.5-59.5
Solution:
10
8
Frequency
6
4
2
0
9.5 19.5 29.5 39.5 49.5
Class boundary
Exercises:
What is a frequency distribution? What benefits does it offer in the summarization and
reporting of data values?
52
Basic Statistics
A stem and leaf plot is a way to graphically represent a data set by categorizing the data in
which part of the number is shown on the left side of graph and called a “Stem” while the last
digit is shown on the right and called a “Leaf.” Take a long list of numbers, and put them in
order from the smallest to largest. Draw a vertical line. Take all but the last digit from each
number and list them from top to bottom on the left, using each only once. These are the
“Stems.” Now take the last digits and put them on the right side of the line. In order, aligning
them with the proper stems. These are the “Leaves.”
Generally:
Graph them!
Example: An insurance company researcher conducted a survey on the number of car thefts
in a large city for a period of 30 days last summer. The raw data are shown. Construct a stem
and leaf plot by using classes 50–54, 55–59, 60–64, 65–69, 70–74, and 75–79.
52 62 51 50 69
58 77 66 53 57
75 56 55 67 73
79 59 68 65 72
57 51 63 69 75
65 53 78 66 55
Solution:
53
Basic Statistics
50, 51, 51, 52, 53, 53, 55, 55, 56, 57, 57, 58, 59, 62, 63,
65, 65, 66, 66, 67, 68, 69, 69, 72, 73, 75, 75, 77, 78, 79
(50, 51, 51, 52, 53, 53 ) (55, 55, 56, 57, 57, 58, 59) (62, 63) (65, 65, 66, 66, 67, 68, 69,
69 ) (72, 73, ) (75, 75, 77, 78, 79)
5| 011233
5| 5567789
6| 23
6| 55667899
7| 23
7| 55789
Note: When the data values are in the hundreds, such as 325, the stem is 32 and the leaf is 5.
For example, the stem and leaf plot for the data values 325, 327, 330, 332, 335, 341, 345, and
347 looks like this.
Steam leaf
32 | 5 7
33 | 0 2 5
34 | 1 5 7
54
Basic Statistics
Example 2: consider the following data on the number of hamburgers sold by a fast-food
restaurant for each of 15 weeks.
1565 1852 1644 1766 1888 1912 2044 1812
1790 1679 2008 1852 1967 1954 1733
A stem-and-leaf display of these data follows.
Leaf unit = 10
15 6
16 4 7
17 3 6 9
18 1 5 5 8
19 1 5 6
20 0 4
Dot Plot
One of the simplest graphical summaries of data is a dot plot. A horizontal axis shows the
range for the data. Each data value is represented by a dot placed above the axis. The dot plot
displays each data value as a dot and allows us to readily see the shape of the distribution as
well as the high and low values.
55
Basic Statistics
Dear students, can you mention some diagrammatic and graphical methods of data
presentation? Do we use the same diagram and graph for different purposes or do we have
different diagrams and graphs for different purposes? If know them, compare with what is
discussed below. We will discuss them in this section.
When the basis of classification is not quantitative, i.e.; when the data are of attribute nature,
statistical data can be presented diagrammatically using charts. The charts could be bar chart,
pie-chart or picot gram all of which having specific uses depending upon the nature of the
information to be depicted.
Bar charts
Bar charts are drawn almost in the same way as graphs. Data are presented by a series of bars,
the heights of each bar showing the size of the observation represented.
There are four main types of bar charts serving different purpose. These are simple bar charts,
component charts, percentage component bar charts and multiple bar charts.
In simple bar charts, each bar represents one and only one figure. A simple bar chart is usually
constructed to represent total only.
Example 2.7: the following table shows the number of student attending in four departments.
56
Basic Statistics
Solution:
60 56
50
Number of students
50 45
40
40
30
20
10
0
Math. Stat. Physics Chemistry
Department
The component bar chart gives the break up in parts which constitutes the aggregate in a year
place or sector. In such type of chart, it is possible to compare changes in part, in aggregate,
as well.
Example 2.8: The table and chart below show the revenue, expenditure of a country on
education
57
Basic Statistics
Primary
200
Secondary
Higher Education
150
100
50
0
1978-80 1880-81 1981-82
Here the interrelated components parts are shown adjoining bars, coloured or marked
differently. This allows comparison between different parts.
Example 2.9: The charts below show the revenue expenditure of a country in education
58
Basic Statistics
Primary
90
Secondary
80
Higher Education
70
60
50
40
30
20
10
0
1978-80 1880-81 1981-82
A pie-chart is a circle divided by radical lines into sections or slice so that the area of each
section is proportional to the size of the amount represented. It is a simple description display
of data that sum to a given total. A pie-chart is probably the most illustrative way of
displaying quantities as percentage of a given total. The total area of the pie represents 100
percent of the quantity of interest (the sum of the variable values in all categories of the slice
denotes. Thus, a pie-chart indicates relative frequencies by slicing up a circle into distinct
sectors.
The sum of angles at a point being 360o, the component parts of the data are expressed as
proportions of 360o and the sectors of circle represent these parts. The degrees corresponding
to components are obtained by dividing the amount for each item divided by the total and
multiplying by 360o and to be drawn by means of a protractor.
In order to draw pie chart, it is convenient to form beforehand a table of percentages and the
corresponding angles to be drawn at the center of the circle.
Example 2.10: The following table shows the monthly expense of family with income of
1000 Birr.
59
Basic Statistics
Solution:
25%
20%
60
Basic Statistics
pictorial form so that one can readily identify these characteristics and can compare one
frequency distribution with another.
Histogram, frequency polygon and cumulative frequency curves are common ways of
representing frequency distribution graphically.
Is a line chart of frequency distribution in which either the values of discrete variables or the
class marks of classes are plotted against the frequencies and these plotted points are joined
together by straight lines?
It is thus a graphic presentation tool that may be used as an alternative to the histogram. For a
large number of classes, a frequency polygon is preferable.
For a frequency distribution where class intervals are equal, the area of frequency polygon is
equal to the area of the histogram.
Solution:
Class
limits Frequency Class Marks
10_19 4 14.5
20_29 5 24.5
30_39 8 34.5
40_49 6 44.5
50_59 2 54.5
61
Basic Statistics
10
Frequency 6
4
0
4.5 14.5 24.5 34.5 44.5 54.5 64.5
Class mark
Remark: We enclose the polygon to imaginary class marks to the left and to the right of the
extreme class marks.
The ogive curve can be traced either on less than basis or more than basis.
a) ‘Less than Ogive’: Upper class boundaries are plotted against the ‘less than’ cumulative
frequencies.
b) ‘More than’ Ogive: Lower class boundaries are plotted against the ‘more than’ cumulative
frequencies.
(b) the ‘More than’ ogive for the above frequency distribution.
Solution:
30 _ 39 8 17 16
40_ 49 6 23 8
50 _ 59 2 25 2
Frequency
25
20
15
10
5
0
9.5 19.5 29.5 39.5 49.5 59.5
Upper Class boundary
Frequency
25
20
15
10
5
0
9.5 19.5 29.5 39.5 49.5 59.5
63
Basic Statistics
Depending on the kinds of numerical data we have we may have the scatter plot of the
following forms.
Example: The local ice cream shop keeps track of how much ice cream they sell versus the
noon temperature on that day. Here are their figures for the last 12 days:
Ice Cream Sales vs Temperature
14.2° $215
16.4° $325
11.9° $185
15.2° $332
18.5° $406
22.1° $522
19.4° $412
64
Basic Statistics
25.1° $614
23.4° $544
18.1° $421
22.6° $445
17.2° $408
65
Basic Statistics
A statistical table is an orderly and systematic presentation of numerical data in rows and
columns. Rows (stubs) are horizontal and columns (captions) are vertical arrangements. The
use of tables for organizing data involves grouping the data into mutually exclusive categories
of the variables and counting the number of occurrences (frequency) to each category. These
mutually exclusive categories, for qualitative variables, are naturally occurring groupings.
In the case of large size quantitative variables like weight, height, etc. measurements, the
groups are formed by amalgamating continuous values into classes of intervals. There are,
however, variables which have frequently used standard classes. One of such variables, which
have wider applications in demographic surveys, is age.
The simple frequency table is used when the individual observations involve only to a single
variable whereas the cross tabulation is used to obtain the frequency distribution of one
variable by the subset of another variable. In addition to the frequency counts, the relative
frequency is used to clearly depict the distributional pattern of data. It shows the percentages
of a given frequency count.
On the other hand, in cross tabulated frequency distributions where there are row and column
totals, the decision for the denominator is based on the variable of interest to be compared
over the subset of the other variable.
Although there are no hard and fast rules to follow, the following general principles should be
addressed in constructing tables.
Chapter Summary
The first step in any statistical investigation is collecting relevant data. After the data has been
collected, they have to be organized and presented in a systematic manner. Common methods
of presenting numerical data are frequency distributions, diagrammatic and graphical
methods. Frequency distributions are of different types. These are: ungrouped or discrete,
qualitative, grouped or continuous, relative, and cumulative frequency distributions.
Based on the frequency distributions mentioned above we can present our data using different
diagrams and graphs. The different types of diagrams discussed are bar charts (simple,
component, percentage component, multiple, and pie chart). We have different types of
graphs for presentation of data such as histogram, frequency polygon, cumulative frequency
curves (ogive curves), line graph and vertical line graph.
67
Basic Statistics
Check list
Dear students, have you made yourself familiar with the following points mentioned below?
If not, make sure that you have understood them very well by returning back and referring the
units they are found in.
Exercises
1. For 75 employees of a large department store, the following distribution for years of service
was obtained. Construct a histogram, frequency polygon, and ogive for the data.
68
Basic Statistics
2. The salaries (in millions of dollars) for 31 NFL teams for a specific season are given in this
frequency distribution.
69
Basic Statistics
Terrible 10
0 2 5 0 1 4 1 0 2 1
5 0 1 3 0 0 2 1 3 1
1 4 0 2 4 1 2 4 0 4
3 5 0 1 3 6 4 2 0 2
0 2 3 0 4 2 5 1 1 2
2 1 6 5 0 3 3 0 0 4
70
Basic Statistics
a) Group these data into a frequency distribution showing how often each of the values occur
and draw a bar chart.
b) Convert the distribution obtained in (a) above into a cumulative “or more “distribution and
draw its ogive.
8. In a 2-week study of the productivity of workers, the following data were obtained on the
total number of acceptable pieces which 100 workers produced:
65 36 49 84 79 56 28 43 67 36
43 78 37 40 68 72 55 62 22 82
88 50 60 56 57 46 39 57 73 65
59 48 76 74 70 51 40 75 56 45
35 62 52 63 32 80 64 53 74 34
76 60 48 55 51 54 45 44 35 51
21 35 61 45 33 61 77 60 85 68
45 53 34 67 42 69 52 68 52 47
63 65 55 61 73 50 53 59 41 54
41 74 82 58 26 35 47 50 38 70
families.
Below 50.5 3
Below 55.5 10
Below 60.5 16
Below 65.5 20
Below 70.5 22
Below 75.5 25
72
Basic Statistics
11. The following table shows the type of cars manufactured by a certain company during 1972-
1975.
Years
Cars 1972 1973 1974 1975
Toyota 400 300 380 450
Construct
X Y
351.4 18.4
291.3 15.8
325.0 19.3
422.7 22.5
238.1 16.0
514.5 24.
a. Draw a scatter diagram representing these data.
b. Does there appear to be any relationship between the variables? If so, is the relationship
direct or inverse?
13. A recent study showed that a typical American car owner incurs the following expenses,
on an average, when he leases a car for three years.
73
Basic Statistics
Gasoline 1,350
Insurance 1,800
Maintenance 1,350
74
Basic Statistics
Dear students, In the Previous chapter you have learned about how to present the collected
data using the appropriate methods of data presentation such as tables, charts and graphs in
order to visualize the simple and easily understandable data. In this chapter, you will learn
about the way that allows you to represent the whole numerical or qualitative data by the
central (average) observation by using measures of central tendencies like mean, mode and
median, as well as how to find the distribution or average dispersion of each value from the
representative average through the measures of variations like variances, standard deviation
and etc. In addition to that you will learn about Quantiles like quartiles, deciles and
percentiles, which is the way to divide the ordered data in to different equal parts to get
partitioned data.
Introduction
In the previous chapter, we discussed the techniques of classification and tabulation, which
help in summarizing the collected data and presenting them in the form of a frequency
distribution.
Now suppose the students from two or more classes appeared in the examination and we wish
to compare the performance of the classes in the examination or wish to compare the
performance of the same class after some coaching over a period of time. When making such
comparisons, it is not practicable to compare the full frequency distributions of marks.
However compactly these may be presented. Therefore, for such statistical analysis, we need
a single representative value that describes the entire mass of data given in the frequency
distribution. This single representative value is called the central value, measure of location or
an average around which individual values of a series cluster. This central value or an average
enables us to get a gist of the entire mass of data, and its value lies somewhere in the middle
of the two extremes of the given observations. For this reason, such a central value or an
average is frequently called a measure of central tendency.
From the above discussion, it should be clear to you that the concept of a measure of central
tendency is concerned only with quantitative variables and is undefined for qualitative
variables as these are immeasurable on a scale. The first step in looking at data is to describe
the data at hand in some concise way. In smaller studies this step can be accomplished by
listing each data point. In practice, however, this procedure is tedious or impossible and, even
if it were possible would not give an over-all picture of what the data look like. When we
want to make comparison between groups of numbers it is good to have a single value that is
considered to be a good representative of each group. This single value is called the average
of the group. Averages are also called measures of central tendency and it tells us where the
center of the distribution of data is located on the scale we are using. There are several such
measures, but here we shall discuss the most commonly used measure of central tendency.
These include Mean, Median and Mode. There are also other measures of central tendency
(sometimes called non-central location) such as quartiles, deciles and percentiles.
76
Basic Statistics
Objectives
Since the number of sample points is frequently large and it is easy to lose track of the overall
picture by looking at all the data at once, the data must be summarized as briefly as possible.
Thus, the most important objective of data analysis is to determine a single value for the entire
mass of data. In addition to this the following are some objectives of measures of central
tendency.
Before attempting the measures of central tendency, let’s see some of the summation notation
that is used frequently.
expression like this very often, so mathematicians have developed a shorthand notation to
represent a sum of scores, called the Summation Notation.
Notations: Σ is read as Sigma (the Greek Capital letter for S) means “the sum of”. Suppose n
values of a variable are denoted as X1, X2, X3…., Xn then ΣXi = X1 +X2+ X3 +…+Xn
where the subscript i range from 1 up to n which is used to identify the position of an element.
Example: Suppose the following were scores made on the first assignment for Stat 173 of 1 st
year five Sport Science Summer students: 5, 7, 7, 6, and 8. In this example set of five
numbers, where n=5, the summation could be written: ΣXi = X1,+X2,+ X3 + X4,+ X5 = 5 +
7+7+6+8=33.
77
Basic Statistics
Properties of Summation
1. Σ(Xi Yi) = ΣXi ΣYi , where the number of X values = the number of Y values.
If an average is good representative, then it is said to be typical average and an average is not
good representative (only a theoretical value) then it is said to be a descriptive average.
There are several different measures of central tendency; each has its own advantages and
disadvantages. Among those:
The choice of these averages depends up on which best fit the property under discussion.
78
Basic Statistics
3.3.1. Mean
The various methods of determining the actual value at which the data tend to concentrate are
called measures of central Tendency or averages or mean. Hence, an average is a value which
tends to sum up or describe the mass of the data. There are four types of Means which are
suitable for a particular type of data. These are Arithmetic Mean, Weighted Arithmetic Mean,
Geometric Mean and Harmonic Mean.
The Arithmetic Mean (or simply the mean): is the most popular and widely used as well as
best-understood measure of central tendency for the set of quantitative data set. The
arithmetic mean of a set of observation is defined as the sum of the values of all observation
in a series divided by the number of items in the series. Suppose X1, X2, X3, …,Xn are n
observed values in a sample of size n from a population of size N, n<N, then the arithmetic
mean for ungrouped data of the sample, denoted by is given
as: .
population.
Example: Suppose the sample consists of birth weights (in grams) of all live born infants at a
private hospital in a certain city during a 1-week period. These samples are, 3265 3323 2581
2759 3260 3649 2841 3248 3245 3200 3609 3314 3484 3031 2838
3101 4146 2069 3541 2834; then find arithmetic mean for the sample birth
weights.
If X is a variable having values X1, X2,,Xm occurring with frequencies of f1, f2,…, fm
respectively, then its arithmetic mean is given by:
79
Basic Statistics
Solution: = = = = 4.
This method is applicable where the entire range of observations has been grouped into a
continuous frequency distribution (grouped frequency distribution). The value or score (X) of
each observation is assumed to be identical with the mid-point (mi) of the class interval to
which it belongs. In such cases the mean of the distribution is computed as: = ,
where
k is number of classes,
mi is the midpoint of the ith class and
fi is the ith class frequency.
In order to compute the mean of grouped data we should considered the following:
80
Basic Statistics
Example: Calculate the mean for grouped data on the amount of time (in hours) that 80
college students devoted to leisure (vacation) activities during a typical school week given
below:
10 – 14 8
15 – 19 28
20 – 24 27
25 – 29 12
30 – 34 3
35 – 39 1
40 – 44 1
Solution: the class marks of the distribution are: 12, 17, 22, 27, 32, 37, 42.
= = = = 20.7 hours.
i) Characteristics:
81
Basic Statistics
3) If we have means and of two groups having the same unit of measurements of a
variable, based on n1 and n2 observations respectively we can compute the mean of the
combined groups ( ) which is given by: = . This is true for more than two
Example: If the mean of one class of 50 students are 30 and the mean of marks of another
class of 100 students are 40. What is the mean of all 150 students?
4) If a wrong figure has been used when calculating the mean, then the correct mean can be
obtained without repeating the whole process using:
Example: An average weight of 10 students was calculated to be 65. Latter it was discovered
that one weight was misread as 40 instead of 80 k.g. Calculate the correct average weight.
a) If a constant k is added/ subtracted to/from every observation then the new mean will be the
old mean ± k respectively.
b) If every observations are multiplied by a constant k then the new mean will be k*old mean.
c) If every observations are divided by a constant k then the new mean will be 1/k*old mean.
set of capsules of another drug are obtained by the linear transformation Y = 2X – 0.5, i = 1,
i i
82
Basic Statistics
a. If 10 is added to each of the numbers in the set, then what will be the mean of the new set?
b. If each of the numbers in the set are multiplied by -5, then what will be the mean of the new
set?
ii) Advantages:
iii) Disadvantages:
Weighted Mean
In computation of mean we had given equal importance to each observation. While, when
averaging quantities, it is often necessary to account for the fact that not all of them are
equally important in the phenomenon being described. In order to give quantities being
averaged there proper degree of importance, it is necessary to assign them relative importance
called weights, and then calculate a weighted mean.
83
Basic Statistics
In general, the weighted mean w of a set of values X1, X2, …,Xn, whose relative importance
is expressed numerically by a corresponding set of weights W1, W2, … Wn, is given by:
Solutions:
Solution:
we don't consider the weights, the average mark of student will be 123 and it is totally
wrong!!
Geometric Mean
If the observed values are measured as ratios, proportions or percentages and the series of
observations contains one or more unusually large values geometric mean gives a better
measure of central tendency than other means.
It is obtained by taking the nth root of the product of “n” values, i.e, if the values of the
observation are demoted by X1,X2,…,Xn, then
GM = .
Whenever the frequency distribution are grouped (continuous), class marks (mi) of the class
intervals are considered as Xi and the second formula of the above can be used.
84
Basic Statistics
But if n is a large number, the problem of computing the nth root of the product of these values
by simple arithmetic is a tedious work. To facilitate the computation of geometric mean we
make use of logarithms. The above formula when reduced to its logarithmic form will be:
The logarithm of the geometric mean is equal to the arithmetic mean of the logarithms of
individual values. The actual process involves obtaining logarithm of each value, adding them
and dividing the sum by the number of observations. The quotient so obtained is then looked
up in the tables of anti-logarithms which will give us the geometric mean.
Example: The geometric mean may be calculated for the following parasite counts per 100
fields of thick films that was taken from 42 samples, 7 8 3 14 2 1 440 15 52
6 2 1 1 25 12 6 9 2 1 6 7 3 4 70 20 200 2 50
21 15 10 120 8 4 70 3 1 103 20 90 1 237
The anti-log of 0.9999 is 9.9992 ≈10 and this is the required geometric mean.
By contrast, the arithmetic mean, which is inflated by the high values of 440, 237 and 200 is
39.8
i) Characteristics
1. It is a calculated value and depends upon the size of all the items.
2. It gives less importance to extreme items than does the arithmetic mean.
3. For any series of items it is always smaller than the arithmetic mean.
ii) Advantages
1) Since it is less affected by extremes it is a more preferable average than the arithmetic mean
iii) Disadvantages
2) It cannot be determined if there is any negative value in the distribution, or where one of the
items has a zero value.
It is a suitable measure of central tendency when the data pertains to speed, rates and time.
The harmonic mean is defined as the reciprocal of the mean of the reciprocals of a series of
observations. That is let X1, X2, …, Xn be the values of a set of observations, then the
When the observed values X1, X2, …Xk have the corresponding frequencies f1, f2,…fk
While the frequency distribution are grouped (continuous), class marks (mi) of the class
intervals are considered as Xi in the frequent data above.
Example: Milk is sold at a place at the rates of 1.8, 2, 2.25 and 2.5 birr per liter in four
different months. Assuming equal amount of money is spent on milk by a family in the four
months; find the average price paid per liter using harmonic mean.
86
Basic Statistics
Note that if all the observations are positive, we have the relationship among the three means
given as AM >= GM >= HM and all these three means are equal if all positive valued
observations are equal.
3.3.2. Mode
The mode is the value of the observation that occurs with the greatest frequency. A particular
disadvantage is that, with a small number of observations, there may be no mode. In addition,
sometimes, there may be more than one mode such as when dealing with a bimodal (two-
peak) distribution. It is even less amenable (responsive) to mathematical treatment than the
median.
Example: Find the mode for the following data: (a) 22, 66, 69, 70, 73. (No modal value) (b)
1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5 (modal value = 3.0 kg). 10, 10, 9, 9, 8, 12, 15, 5
(modal value = 9 and 10). Hence, it is possible for a frequency distribution to have more than
one mode.
Note: Distributions with one mode are called unimodal, those with two modes are called
bimodal, and those with more than two modes are called multimodal.
To find the Modal value for grouped (continuous) frequency distribution, first find the modal
class which is the class that contains the mode and it is the class with the highest frequency.
Then to compute the modal value for grouped data, we use the formula:
=L+ * w , where
;
87
Basic Statistics
Note: The modal class is a class interval with the highest frequency.
Example: consider the above data set for the amount of time (in hours) that 80 college
students devoted to leisure activities during a typical school week and calculate the modal
value of the data.
Solution: the modal class interval is the second class as the frequency (=28) is the highest
among all others. Therefore,
) and )
i) Characteristics
1. It is an average of position
2. It is not affected by extreme values
3. It is the most typical value of the distribution
88
Basic Statistics
ii) Advantages
iii) Disadvantages
Note: being the point of maximum density, mode is especially useful in finding the most popular
size in studies relating to marketing, trade, business, and industry. It is the appropriate average
to be used to find the ideal size.
I. The Median
An alternative measure of location, perhaps second in popularity to the arithmetic mean, is the
median. In a distribution, median is the value of the variable which divides it in to two equal
halves. In an ordered series of data median is an observation lying exactly in the middle of
the series. It is the middle most value in the sense that the number of values less than the
median is equal to the number of values greater than it.
Suppose there are n observations in a sample and if these observations are ordered from
smallest to largest, then the sample median foe ungrouped data is defined as:
The rational for these definitions is to ensure an equal number of sample points on both sides
of the sample median. The median is defined differently when n is even and odd because it is
89
Basic Statistics
impossible to achieve this goal with one uniform definition. For samples with an odd sample
size, there is a unique central point; for example, for sample of size 7, the fourth largest point
is the central point in the sense that 3 points are both smaller and larger than it. For samples
with an even size, there is no unique central point and the middle 2 values must be averaged.
Thus, for sample of size 8, the fourth and the fifth largest points would be averaged to obtain
the median, since neither is the central point.
(a) 5, 2, 8, 9, 4. (b) 2, 1, 8, 3, 5, 8.
, where
½n = Number of observations to be counted off from one end of the distribution to reach the
median and
90
Basic Statistics
NB: the median of grouped data is also the value of X on the horizontal axis which
corresponds to the intersection point of the less than Ogive and more than Ogive (they
intersects at n/2) if they are drawn in the same plane.
Example: consider the above data set for the amount of time (in hours) that 80 college
students devoted to leisure activities during a typical school week and calculate the median of
the data.
Solution: n/2 = 80/2 =40. The class interval that contains the 40th observation from the less
than cumulative distribution is the third-class interval.
i) Characteristics
1) It is an average of position.
ii) Advantages
3) The median may be located even when the data are incomplete, e.g, when the class intervals
are irregular and the final classes have open ends.
iii) Disadvantages
91
Basic Statistics
1. The median is not as well suited to algebraic treatment as the arithmetic, geometric and
harmonic means.
2. It is not as generally familiar as the arithmetic mean.
Selecting among the mean the median and the modal value is based on the following
principles (it is not fast and hard rules).
Is the data
categorical?
Is total of
interest?
Yes Use Mean
No
Is
Yes Use Median
distribution
skewed?
No
Use Mean
92
Basic Statistics
In case the symmetrical distribution, mean, median and mode coincide. However, for a
moderately asymmetrical (nonsymmetrical) distribution, mean and mode usually lie on the
two ends and median lies between them and they have the following important empirical
relationship, which is mean – mode = 3(mean - median) thus, mode= 3*median – 2*mean.
Example: in a moderately asymmetrical distribution, the mean and the median are 20 and 25
respectively, and then find the mode of the distribution.
Solution: mean – mode = 3 (mean - median) => mode = 3median – 2mean = 3*25 – 2*20 =
35.
When a distribution is arranged in order of magnitude of items, the median is the value of the
middle term. Their measures that depend up on their positions in distribution quartiles,
deciles, and percentiles are collectively called quantiles.
Quartiles: Quartiles are measures that divide the frequency distribution in to four equal
parts. The value of the variables corresponding to these divisions are denoted Q , Q , and Q
1 2 3
often called the first, the second and the third quartile respectively.
Q is a value which has 25% items which are less than or equal to it. Similarly, Q has 50%
1 2
items with value less than or equal to it and Q has 75% items whose values are less than or
3
equal to it.
The Kth quartile Qk for ungrouped data is the value of the item which is the
The computation of three quartiles for a grouped data can be done as follows:
Calculate kn/4 and search for the minimum cumulative frequency which is greater than or
equal to kn/4, k=1, 2, 3.
93
Basic Statistics
The class corresponding to this cumulative frequency is the kthquartile class. This is the class
where Qk lies.
CF = the less than cumulative frequency corresponding to the class immediately preceding the
kth quartile class
C= the class width of the quartile class and f= frequency of the kth quartile class
Deciles
Deciles are measures that divide the frequency distribution in to ten equal parts. The values of
the variables corresponding to these divisions are denoted D , D ,.. D often called the first,
1 2 9
To find D (i=1, 2,..9) we count iN/10 of the classes beginning from the lowest class.
i
CF = the less than cumulative frequency corresponding to the class immediately preceding the
kth deciles class
94
Basic Statistics
Percentiles
Percentiles are measures that divide the frequency distribution in to hundred equal parts. The
values of the variables corresponding to these divisions are denoted P , P ,.. P often called
1 2 99
To find P (i=1, 2,..99) we count iN/100 of the classes beginning from the lowest class.
i
CF = the less than cumulative frequency corresponding to the class immediately preceding the
kth percentiles class
2) Q2 = D5 = P50 = median, P25 = Q1, P75 = Q3, and Di = Pi*10, i=1, 2, 3,…9.
3) Quantiles have the advantage that being less sensitive to outliers and of not being much
affected by the sample size (n).
95
Basic Statistics
The term dispersion is generally used in two senses. Firstly, dispersion refers to the variations
of the items among themselves. If the value of all the items of a series is the same, there will
be no variation among different items of a series; the more will be the dispersion. Secondly,
dispersion refers to the variation of the items around an average. If the difference between the
value of items and the average is large, the dispersion will be high and on the other hand if the
difference between the value of the items and averaging is small, the dispersion will be low.
Thus, dispersion is defined as scatteredness or spreadness of the individual items in a given
series.
The measures of dispersion are helpful in statistical investigation. Some of the main
objectives of dispersion are as under:
96
Basic Statistics
Mean
Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50
Set 3: 50 50 50 50 50 50 50 50
The three data sets have a mean of 50, but obviously set 1 is more “spread out” than set 2 and
set 3 has no variability.
Objectives
The general object of measuring dispersion is to obtain a single summary figure which
adequately exhibits whether the distribution is compact or spread out.
1. Absolute measures of dispersion: Absolute measure is expressed in the same statistical unit
in which the original data are given such as kilograms, tones etc. These measures are suitable
for comparing the variability in two distributions having variables expressed in the same units
and of the same averaging size. These measures are not suitable for comparing the variability
in two distributions having variables expressed in different units.
97
Basic Statistics
The measures of dispersion which are expressed in terms of the original unit of a series are
termed as absolute measures. Such measures are not suitable for comparing the variability of
two distributions which are expressed in different units of measurement and different average
size. Relative measures of dispersions are a ratio or percentage of a measure of absolute
dispersion to an appropriate measure of central tendency and are thus pure numbers
independent of the units of measurement. For comparing the variability of two distributions
(even if they are not measured in the same unit), we compute the relative measure of
dispersion instead of absolute measures of dispersion.
98
Basic Statistics
It is useful for comparing variation in two or more distributions where units of measurements
are the same. Various measures of dispersions are in use. The most commonly used measures
of dispersions are:
The Range (R): The range is the largest score minus the smallest score.
Where R=Range, L= Largest value in the series, S= smallest value in the series
The following two distributions have the same range, 13, yet appear to differ greatly in the
amount of variability.
Distribution 1: 32 35 36 36 37 38 40 42 42 43 43 45
Distribution 2: 32 32 33 33 33 34 34 34 34 34 35 45
99
Basic Statistics
For this reason, among others, the range is not the most important measure of variability.
Relative Range (RR): It is also sometimes called coefficient of range and given by:
Example:
2. If the range and relative range of a series are 4 and 0.25 respectively. Then what is the value of:
Example 4.1: five students obtained the following marks in statistics: . Find the
Range and coefficient of range
Solution: Here,
LS 35 15
Coefficient of Range = 0.4
LS 35 15
Example 4.1: Find out range and coefficient of range of the following series
Frequency 4 9 15 30 40
Solution: Here,
100
Basic Statistics
Example 4.2: Find out range and coefficient of range of the following series
101
Basic Statistics
Solution: Here,
L = 92 , S= 21
Range = L – S = 92 – 21 = 71
LS 92 21
Coefficient of Range = = = 0.62832
LS 92 21
It is a quick and dirty measure of variability, although when a test is given back to students
they very often wish to know the range of scores. Because the range is greatly affected by
extreme scores, it may give a distorted picture of the scores. Range for grouped frequency
distribution is the upper class boundary of the last class interval minus the lower class
boundary of the first class interval, i.e., R = UCBlci - LCBfci .
Merits:
• It is rigidly defined.
Demerits:
102
Basic Statistics
Inter-quartile range and quartile deviation are other measures of dispersion. The difference
between the upper quartile and lower quartile is called inter-quartile range.
Symbolically,
The inter-quartile ranges covers dispersion of middle 50% of the items of the series. Quartile
deviation, also called semi-inter-quartile range is half of the difference between the upper and
lower quartile. That is, half of the inter-quartile range. Its formula as:
The relative measure of quartile deviation also called the coefficient of quartile deviation is
defined as:
Example 4.3: Find inter-quartile deviation, quartile deviation and coefficient of quartile
deviation from the following data.
Solution: First arrange the data in ascending order. 25, 18, 20, 24, 27, 28, 30
103
Basic Statistics
Example 4.4: Find inter-quartile range, quartile deviation and coefficient of quartile deviation
from the following data
Marks 2 3 4 5 6 7 8 9
No. Of students 1 11 12 1 5 1 7 5
0 3 2
Solution:
Marks No. Of CF
students
2 10 10
3 11 21
4 12 33
5 13 46
6 5 51
7 12 63
8 7 70
9 5 75=N
Total N=75
104
Basic Statistics
Any applied statistician who has analyzed a number of sets of real data is likely to have come
across outliers. The intuitive definition of an outlier would be ‘an observation deviates so
much from other observations as to arouse suspicious that it was generated by different
mechanism.’ That is, outliers are observations that are distinct from the main body of the data
and are incompatible with the rest of data. These values may be genuine observations from
individuals with very extreme levels of the variable.
Remark: Q.D or C.Q.D includes only the middle 50% of the observation.
IQR Rule
A simple approach to detect outlier is that print the data and visually checks them by eye. This
is suitable of the number of observations is not too large and if the potential outlier is much
lower than or higher than the rest of the data.
When the number of observations gets larger and larger, we can check the presence of outlier
by the 1.5 IQR rule. The steps to identify outliers are presented as follows:
105
Basic Statistics
1 2 5 5 7 8 10 11 11 12 15 25
Solution: The first step is arranging the data in ascending order then let us calculate the first
and third quartile
Therefore, the observation less than -4 and greater than 20 are considered as outlier. That is,
25 is outlier.
By using the concept of 1.5IQR rule, we can draw box plot which is used to give five-number
summaries. Five-number summaries contains minimum, quartile one, median, quartile three
and maximum.
1. Notice that you must have ordered data before you can find the Five – Number Summaries.
2. Find the median first. It’s the middle point
3. Then find the quartiles, Q1 and Q3 and the 1.5 IQR outlier limits
106
Basic Statistics
4. Draw a “box" from Q1 to Q3 with bars at Q1, Q3 and the median. (In the below example the
box is horizontal, but it could also be vertical.)
5. Draw a straight line from Q3 to either the largest observation or the upper
outlier bound, whichever is smaller.
6. Draw a straight line from Q1 to either the smallest observation or the lower
outlier bound, whichever is larger.
7. Any remaining observations (the outliers) are shown as individual points on the plot.
Exercise: Take the data 1, 2, 5, 5, 7, 8, 10, 11, 12, 12, 18, 25 and draw box plot
Merits of
It is simple to understand
It is easy to compute
It is well-defined
It helps in studying the middle 50% item in the series
It is not affected by the extreme items
It is useful in the case of open-ended
Demerits of
The Mean Deviation (M.D): The mean deviation of a set of items is defined as the arithmetic
mean of the values of the absolute deviations from a given average. Depending up on the type
of averages used we have different mean deviations.
MD = .
107
Basic Statistics
For the case of a frequency distribution data where the values X1, X2, X3, …,Xm occur f1, f2,
f3, …, fm times respectively, then mean deviation is obtained by:
MD = .
For grouped data that is if the data is given in the form of frequency distribution of K-classes
in which mi and fi are the class marks and frequency of the ith class respectively then the mean
1
b. Mean deviation from median =
n
| xi Md. |
1
c. Mean deviation from mode =
n
| xi mod e |
1
b’. Mean deviation from median =
n
fi | xi Md. |
1
c’. Mean deviation from mode =
n
fi | xi Mode. |
108
Basic Statistics
Xi 10 8 9 7 6
Fi 8 9 13 6 3
then
Xi 10 8 9 7 6
fi 8 9 13 6 3
Interpretation: each value deviates on average 1.02 from the arithmetic mean, 8.4.
Note: You can also calculate the mean deviation about the Median and Mode.
CMD = .
Exercise: Find all coefficients of mean deviations for the following frequency distribution:
Merits of
It is simple to understand
It is easy to compute
It is well-defined
109
Basic Statistics
Demerits of
Note that: of all the mean deviations taken about different averages or any arbitrary value,
the mean deviation about the median has the smallest value.
The Variance: is the "average squared deviation from the mean" and it measures the average
of the square of the deviations from the mean for each observations.
Suppose we have population of N observations, say X1, X2, X3, …, XN, then we define the
population variance as:
= = .
But most of the time we have sample of n observations, say X1, X2, X3, …, Xn from the
population of N, then we define the sample variance as:
= .
This measure of variation is universally used to show the scatter of the individual
measurements around the mean of all the measurements in a given distribution. But the
disadvantage is that the units of variance are the square of the units of the original
observations. The easiest way for this difficulty is to use the square root of the variance as a
measure of variability called the standard deviation.
110
Basic Statistics
Standard deviation is the most important and widely used measure of dispersion. It was
first used by Karl Pearson in 1893. The standard deviation of a statistical data is defined as the
positive square root of the mean of the squared deviations of items from the mean of the series
under consideration.
The population and the sample standard deviations denoted by σ and S respectively are
For the case of frequency distribution data the population and sample variance are given as:
= and =
and the square roots of these will give the corresponding standard deviations.
To obtain the variance and standard deviation of data presented in a grouped frequency
distribution, we make the same assumptions that made in the calculation of the mean for
grouped data in which each value falling in to a class is identically distributed and
observations in each class represented by the class mark. The calculation is the same to the
formula of data given in frequency distribution except that Xi is substitute by the mid points
of each class and m by k.
111
Basic Statistics
5. Since the data is a sample, divide the number (from step 4 above) by the number of
observations minus one, (i.e., n-1), where n is the number of observations in the data set.
Example: Areas of spray able surfaces with DDT from a sample of 15houses are as follows
(m2): 101, 105, 110, 114, 115, 124, 125, 125, 130, 133, 135, 136, 137, 140, 145.Find the
variance and standard deviation of the above distribution.
It implies that each spray surface of the house deviates from the mean by 13.37 m2 on
average.
Examples: Find the variance and standard deviation of the following sample data
Class Frequency
40-44 7
45-49 10
50-54 22
55-59 15
60-64 12
65-69 6
70-74 3
112
Basic Statistics
Solutions: a) = 11
X 5 10 12 17 Total
i
(X - )2 36 1 1 36 74
i
b) = 55
m (midpoint) 42 47 52 57 62 67 72 Tota
i
l
1. Consider a sample X1, …..,Xn, which will be referred to as the original sample. To create a
translated sample X1+C, add a constant C to each data point. Let Yi = Xi+C, i = 1, …., n.
Suppose we want to compute the standard deviation of the translated sample, we can show
that the following relationship holds: If Yi = Xi + C, i = 1, …., n, then Sy = Sx.
Therefore, the standard deviation of Y will be the same as the standard deviation of X.
2. What happens to the standard deviation if the units or scales being worked with are changed?
A re-scaled sample can be created: If Yi = CXi, i=1, ……., n, then Sy = CSx and S2y = C2S2x.
Therefore, to find the variance and standard deviation of the Y’s compute the variance and
standard deviations of the X’s and multiply it by the constant C2 and C, respectively.
113
Basic Statistics
Solution: Let Yi denote the °F temperature that corresponds to a °C temperature of Xi. Since
the required transformation to convert the data to °F would be: Yi = Xi + 32, i= 1, 2, 3, …, n.
3. On the other hand, where several standard deviations for a variable are available and if we
need to compute the combined standard deviation, the pooled standard deviation (Sp) of the
entire group consisting of all the samples may be computed as:
4. The value of S is usually positive and it is zero only when all of the data values are the
same. Values close together will yield a small SD, whereas values spread apart will yield a
larger SD. Also, larger values of S indicate greater amount of variation.
Example: The standard deviation of systolic blood pressure was found to be 10.6 and 15.2
mm Hg, respectively, for two groups of 12 and 15 men. What is the standard deviation of
systolic pressure of all the 27 men?
Solution: Given: Group 1: S1 = 10.6 and n1 = 12 Group 2: S2 = 15.2 and n2 = 15, then
The coefficient of variation is also useful for comparing the reproducibility of different
variables. CV is a relative measure free from unit of measurement.
114
Basic Statistics
Examples: An analysis of the monthly wages paid (in Birr) to workers in two firms A and B
belonging to the same industry gives the following results.
Since C.V < C.V in firm B there is greater variability in individual wages.
A B,
Just as it is possible to calculate combined mean of two or more groups, similarly the
combined or pooled standard deviation of two or more groups can be calculated. The
combined standard deviation of two groups is denoted by and is computed as follow:
Where is the group sample size and is the variance of the group.
Example 4.7: Two samples of size 100 and 150, respectively, have means 50 and 60 and
standard deviations 5 and 6. Find the mean and standard of the combined sample
of size 250.
Solution: Given
115
Basic Statistics
Exercise: Find the shortcut formula to find standard deviation, mean deviation and combined
deviation.
In certain cases, mean and standard deviation are calculated by using one or two incorrect
values of the variable. Just as we can correct an incorrect mean, similarly, there is a procedure
of correcting an incorrect standard deviation.
1. Find out incorrect sum of square values of the variable. That is,
2. Find corrected . To do so, we subtract the square of the incorrect item from incorrect
and add the square of correct item to incorrect . thus,
116
Basic Statistics
is approximately equal to
is approximately equal to
The standard deviation of the first natural numbers can be found from the following
formula:
For example, the standard deviation of the first 5 natural numbers is given as:
117
Basic Statistics
Example 4.8: If the value of standard deviation in moderately symmetrical distribution is 24,
find the value of mean deviation and quartile deviation.
Example 4.9: If the mean and the standard deviation of 25 boys’ weight are 50 and 5,
respectively, at least how many boys will in the interval ?
Solution:
the interval
When the observations are grouped into classes, all observations in a class are equal to the
midpoint of the class. This introduces some error known as grouping error. Sheppard suggests
a correction known as Sheppard’s correction. It is given by where is the class width.
Merits of
It is simple to understand
It is well-defined
It is based on all items
118
Basic Statistics
Demerits of
It is easy to calculate
It is unduly affected by extreme values
A standard score for sample vale in a data set is obtained by the mean of the data set from the
value and dividing the result by the standard deviation of the data set. Basically, the Z-score is
the number of standard deviations that a given value X is below or above the mean and
defined as Z = (for the sample data sets) and Z = (for the population data sets).
Values above the mean have positive z-scores and values below the mean have negative Z-
scores. The numerical value of the Z-score reflects because of this Z-score is also referred to
as relative measure of relative standing. Scores are generally meaningless by themselves
unless they are compared to the distribution or scores from some reference group. In addition
to comparison the data sets it is useful to transform a given data sets in to a new distribution
and the resulting data has mean value zero and variance one which is the standard normal
distribution (we will see it in chapters of hypothesis testing).
Note: A Z-score value less than -2 and greater than 2 considers as unusual value while
between -2 and 2 is considers as ordinary values.
Examples: 1. Two sections were given introduction to statistics examinations. The following
information was given.
119
Basic Statistics
Student A from section 1 scored 90 and student B from section 2 scored 95. Relatively
speaking who performed better?
Student A performed better relative to his section because the score of student A is two
standard deviations above the mean score of his section while, the score of student B is only
one standard deviation above the mean score of his section.
2. Two groups of people were trained 100km race and tested to find out which group is faster to
complete the race. For the two groups the following information was given:
Relatively speaking:
b) Suppose a person A from group one take 9.2 minutes while person B from Group two take 9.3
minutes, who was faster in completing the race? Why? DO!!
In describing a numerical data set it is not only necessary to summarize the data by presenting
appropriate measures of central tendency, dispersion and relative standing, it is also necessary
to consider the shape of the data – the manner, in which the data are distributed. There are two
measures of the shape of a data set: skewness and kurtosis.
120
Basic Statistics
Moments
Moments are statistical measures used to describe the characteristics of a distribution and we
can have moment about any number A and /or about the mean (called central moment).
Skewness
The direction of the skewness depends upon the location of the extreme values. If the extreme
values are the larger observations, the mean will be the measure of location most greatly
distorted toward the upward direction. Since the mean exceeds the median and the mode, such
distribution is said to be positive or right-skewed. The tail of its distribution is extended to the
right.
On the other hand, if the extreme values are the smaller observations, the mean will be the
measure of location most greatly reduced. Since the mean is exceeded by the median and the
mode, such distribution is said to be negative or left-skewed. The tail of its distribution is
extended to the left.
121
Basic Statistics
sample.
Properties of Skewness
Kurtosis
Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the
bell-shaped distribution (normal distribution) or kurtosis is the degree of measure of
peakedness of a distribution.
122
Basic Statistics
Kurtosis of a sample data set is calculated directly from the data by the formula:
= -
It is also possible to calculate the measure of kurtosis from the rth moment about the mean of
the sample data as: , where is the 4th moment about the mean.
If we want to our reference point to be zero, we can change the above coefficient as:φ = - 3.
123
Basic Statistics
- If the attributes are independent then the probability of possessing both A and B is PA*PB
B
A B1 B2 . . Bj . Bc Total
A1 O1 O12 O1j O1c R1
1
.
.
Ai Oi1 Oi2 Oij Oic Ri
.
.
Ar Or Or2 Orj Orc
1
Total C1 C2 Cj n
124
Basic Statistics
The chi-square procedure test is used to test the hypothesis of independency of two attributes.
For instance, we may be interested
r c (Oij eij ) 2
~ (r 1)( c 1)
2 2
cal
i 1 j 1 eij
Ri * C j
eij
n
125
Basic Statistics
Remark:
r c r c
n Oij eij
i 1 j 1 i 1 j 1
Decision Rule:
r c (Oij eij ) 2
Reject H 0 if ( r 1)( c 1) at
2 2
cal
i 1 j 1 eij
Examples:
1. A geneticist took a random sample of 300 men to study whether there is association between
father and son regarding boldness. He obtained the following results.
Son
Father Bold Not
Bold 85 59
Not 65 91
Using 5% test whether there is association between father and son regarding boldness.
126
Basic Statistics
Solution:
Ri * C j
eij
n
R1 * C1 144 *150
e11 72
n 300
R1 * C2 144 *150
e12 72
n 300
R2 * C1 156 *150
e21 78
n 300
R2 * C2 156 *150
e22 78
n 300
127
Basic Statistics
2 2 (Oij eij ) 2
cal
2
i 1 j 1 e
ij
0.05
Degreesof freedom (r 1)(c 1) 1*1 1
02.05 (1) 3.841 from table.
2. Random samples of 200 men, all retired were classified according to education and number of
children is as shown below
Test the hypothesis that the size of the family is independent of the level of education attained
by fathers. (Use 5% level of significance)
128
Basic Statistics
Solution:
H 0 : Thereis no association between the size of the family and the level of
educationattained by fathers.
H1 : not H 0 .
Ri * C j
eij
n
2 (Oij eij ) 2
3
cal
2
i 1 j 1 e
ij
0.05
Degreesof freedom (r 1)(c 1) 1* 2 2
02.05 (2) 5.99 from table.
129
Basic Statistics
EXERCISES:
1. The accompanying data describe the hourly wage rates (dollars per hour) for 30 employees
of an electronics firm:
22.66 24.39 17.31 21.02 21.61 20.97 18.58 16.61
19.74 21.57 20.56 22.16 20.16 18.97 22.64 19.62
22.05 22.03 17.09 24.60 23.82 17.80 16.28 19.34
22.22 19.49 22.27 18.20 19.29 20.43
Construct a frequency distribution, calculate all of the measures of central tendencies and
measures of dispersions.
2. For 75 employees of a large department store, the following distribution for years of service
was obtained. Calculate the following
Class limits Frequency
1–5 21
6–10 25
11–15 15
16–20 0
21–25 8
26–30 6
a. Mean, median, mode of the data
b. Variance, standard deviation, coefficient variation
c. Q2, D5, P50
131
Basic Statistics
No. of 12 8 17 6 2 23 13 9 5 14
employees
Income (in 1.2 4 1.5 3 13 7 9 6 6 10
“000” ETB)
a. Determine the number of employees involved in the survey
b. Calculate the median income for the employees
c. Calculate the first, second and third quartiles using the data
d. Calculate the standard deviation, mean deviation and coefficient of mean deviation
4. The salaries (in millions of dollars) for 31 NFL teams for a specific season are given in this
frequency distribution.
Class limits Frequency
39.9–42.8 2
42.9–45.8 2
45.9–48.8 5
48.9–51.8 5
51.9–54.8 12
54.9–57.8 5
a. Mean, median, mode of the data
b. Inter quartile range (IQR)
c. Variance, standard deviation, coefficient variation
d. Q2, D5, P50
e. Kurtosis and skewness
f. What can you conclude about the shape of the data?
132
Basic Statistics
5. An insurance company researcher conducted a survey on the number of car thefts in a large
city for a period of 30 days last summer. The raw data are shown. Construct a stem and leaf
plot by using classes 50–54, 55–59, 60–64, 65–69, 70–74, and 75–79.
52 62 51 50 69
58 77 66 53 57
75 56 55 67 73
79 59 68 65 72
57 51 63 69 75
65 53 78 66 55
a. Mean, median, mode of the data
b. Quartile deviation and Inter quartile range (IQR)
c. Variance, standard deviation
d. Q3, D9, P75
e. Kurtosis and skewness
f. What can you conclude about the shape of the data?
6. Find the missing information from the following data.
133
Basic Statistics
9. Consider the following grouped frequency distribution and find: (i) the variance (ii)
coefficient of variation (iii) coefficient of skewness (iv) coefficient of kurtosis.
10. The cost of consumer purchases such as single-family housing, gasoline, Internet services, tax
preparation, and hospitalization were provided in The Wall-Street Journal (January 2, 2007).
Sample data typical of the cost of tax-return preparation by services such as H&R
Block are shown below.
120 230 110 115 160
130 150 105 195 155
105 360 120 120 140
100 115 180 235 255
a. Compute the mean, median, and mode.
b. Compute the first and third quartiles.
c. Compute and interpret the 90th percentile?
134
Basic Statistics
The present lesson is an attempt to overview the concept of probability, thereby enabling the
students to appreciate the relevance of probability theory in decision-making under conditions
of uncertainty. After successful completion of the lesson the students will be able to
understand and use the different approaches to probability as well as different probability
rules for calculating probabilities in different situations.
The overall objective of this lesson is to discuss the concept of probability, counting
techniques, approaches to probability, random variable and probability distributions. After
successful completion of the lesson the students will be able to appreciate the usefulness of
Probability, counting techniques, probability distributions in decision-making and also
identify situations where Binomial, Poisson, exponential, and Normal probability distributions
can be applied.
Life is full of uncertainties. ‘Probably’, ‘likely’, ‘possibly’, ‘chance’ etc. is some of the most
commonly used terms in our day-to-day conversation. All these terms more or less convey the
same sense - “the situation under consideration is uncertain and commenting on the future
with certainty is impossible”. Decision-making in such areas is facilitated through formal and
precise expressions for the uncertainties involved. For example, product demand is uncertain
but study of demand spelled out in a form amenable for analysis may go a long to help
analyze, and facilitate decisions on sales planning and inventory management. Intuitively, we
see that if there is a high chance of a high demand in the coming year, we may decide to stock
more. We may also take some decisions regarding the price increase, reducing sales expenses
etc. to manage the demand. However, in order to make such decisions, we need to quantify
the chances of different quantities of demand in the coming year. Probability theory provides
us with the ways and means to quantify the uncertainties involved in such situations.
135
Basic Statistics
Since uncertainty is an integral part of human life, people have always been interested -
consciously or unconsciously - in evaluating probabilities. Having its origin associated with
gamblers, the theory of probability today is an indispensable tool in the analysis of situations
involving uncertainty. It forms the basis for inferential statistics as well as for other fields that
require quantitative assessments of chance occurrences, such as quality control, management
decision analysis, and almost all areas in physics, biology, engineering and economics or
social life.
In general
Probability theory is the foundation upon which the logic of inference is built.
It helps us to cope up with uncertainty.
Probability is the chance of an outcome of an experiment. It is the measure of how likely an
outcome is to occur.
4.2. Fundamental concepts: experiment and event, event and their relationships, conditional
and joint probability
Any action, whether it is the drawing a card out of a deck of 52 cards, or reading the
temperature, or measurement of a product's dimension to ascertain quality, or the launching of
a new product in the market, constitute an experiment in the probability theory terminology.
136
Basic Statistics
For example, the product we are measuring may turn out to be undersize or right size or
oversize, and we are not certain which way it will be when we measure it. Similarly,
launching a new product involves uncertain outcome of meeting with a success or failure in
the market. A single outcome of an experiment is called a basic outcome or an elementary
event. Any particular card drawn from a deck is a basic outcome.
Rolling a die 1, 2, 3, 4, 5, 6
Example
137
Basic Statistics
Example1: Considering the experiment of rolling of a ;die let A be the event of odd numbers,
B be the event of even numbers, and C be the event of number 8.
A 1,3,5
B 2,4,6
C or empty spaceor impossibleevent
Remark: If S (sample space) has n members then there are exactly 2n subsets or events.
Example 2: For the experiment of drawing a card, we may obtain different events A, B, and
C like:
In the first case, out of the 52 sample points that constitute the sample space, only one sample
point or outcome defines the event, whereas the number of outcomes used in the second and
third case is 13 and 4 respectively.
5. Equally Likely Events: Events which have the same chance of occurring.
6. Elementary Event: an event having only a single element or sample point.
138
Basic Statistics
In other words A + = SA
So P (A + ) = P(S)
or P(A) + P( )=1
or P( A ) = 1 - P( )
As a simple example, if the probability of rain tomorrow is 0.3, then the probability of no rain
tomorrow must be 1 - 0.3 = 0.7.
If the probability of drawing a king is 4/52, then the probability of the drawn card's not being
a King is 1 - 4/52 = 48/52.
exclusive events if there is no sample point in common to both events E 1 and E 2 . For
example, if we roll a fair dice, then the experiment is rolling the dice and Sample space (S) is
S = {1, 2, 3, 4, 5, 6}
If we are interested the outcome of event E 1 getting even numbers and E 2 odd numbers
139
Basic Statistics
11. Independent Events: Two events are independent if the occurrence of one does
not affect the probability of the other occurring. Two events A and B are said to be
independent events if the occurrence of event A has no influence (bearing) on the occurrence
of event B. For example, if two fair coins are tossed, then the result of one toss is totally
independent of the result of the other toss. The probability that a head will be the outcome of
any one toss will always be ½, irrespective of whatever the outcome is of the other toss.
Hence, these two events are independent. On the other hand, consider drawing two cards from
a pack of 52 playing cards. The probability that the second card will be an ace would depend
up on whether the first card was an ace or not. Hence these two events are not independent
events.
Another example a bag contains balls of two different colours say yellow and white. Two
balls are drawn successively. First ball is drawn from a bag and replaced after notes its colour.
Let us assume that it is yellow and denote this event by A. Another ball is drawn from the
same bag and its colour is noted let this event denoted by B. Clearly, the result of first draw
has no effect on the result of the second draw. Hence, the events A and B are independent
events.
12. Dependent Events: Two events are dependent if the first event affects the outcome
or Occurrence of the second event in a way the probability is changed.
Solution
a) S={1,2,3,4,5,6}
140
Basic Statistics
b) S={(HH),(HT),(TH),(TT)}
c) S={t /t≥0}
Sample space can be
Countable ( finite or infinite)
Uncountable.
Definition 4.3:
Set is a collection of well-defined objects. These objects are called elements. Sets usually
denoted by capital letters and elements by small letters. Membership for a given set can be
denoted by to show belongingness and to say not belong to the set.
Description of sets: Sets can be described by any of the following three ways. That is the
complete listing method (all element of the set are listed), the partial listing method (the
elements of the set can be indicated by listing some of the elements of the set) and the set
builder method (using an open proposition to describe elements that belongs to the set).
Types of set
Universal set: is a set that contains all elements of the set that can be considered the objects of
that particular discussion.
Finite set: is a set which contains a finite number of elements. (eg.{x: x is an integer, 0 < x <
5})
Infinite set: is a set which contains an infinite number of elements. (eg. {x : x , x > 0})
141
Basic Statistics
Sub set: If every element of set A is also elements of set B, set A is called sub sets of B, and
denoted by A B.
Proper subset: For two sets A and B if A is subset of B and B is not sub set of A, then A is
said to be a proper subset of B. Denoted by A B.
Equal sets: two sets A and B are said to be equal if elements of set A are also elements of set
B.
Equivalent sets: Two sets A and B are said to be equivalent if there is a one to one
correspondence between elements of the two sets.
There are many ways of operating two or more set to get another set. Some of them are
discussed below.
Union of sets: The union of two sets A and B is a set which contains elements which belongs
to either of the two sets. Union of two sets denoted by , A B (A union B).
Intersection of sets: The intersection of two sets A and B is a set which contains elements
which belongs to both sets A and B. Intersection of two sets denoted by , A B (A
intersection B).
Absolute complement or complement: Let U is the universal set and A be the subset of U,
then the complement of set A is denoted by A` is a set which contains elements in U but does
not belong in A.
Relative complement (or differences): The difference of set A with respected to set B,
written as A\B (or A – B) is a set which contain elements in A that doesn`t belong in B.
142
Basic Statistics
Symmetric difference: of two sets A and B denoted by A B is a set which contain elements
which belong in A but not in B and contain elements which belong in B but not in A. That is,
A B is a set which equals to (A\B) (B\ A).
Let U be the universal set and sets A, B, C are sets in the universe, the following properties
will hold true.
B= A and B are mutually exclusive (That is, they cannot occur
simultaneously)
In many problems of probability, we are interested in events that are actually combinations of
two or more events formed by unions, intersections, and complements. Since the concept of
set theory is of vital importance in probability theory, we need a brief review.
The union of two sets A and B, A B, is the set with all elements in A or B or both.
The intersection of A and B, A B, is the set that contains all elements in both A & B.
The complement of A, Ac, is the set that contains all elements in the universal set that are
not found in A. Some similarities between notions in set theory and that of probability theory
are:
Again, using Venn-diagram, one can easily verify the following relationships:
1. A B ( A B) ( A B) (B A), noting that the three are mutually exclusive;
A B A B and B A' B A.
144
Basic Statistics
If a sample space has finite number of points, it is called a finite sample space. If it has as
many point as natural numbers1, 2, 3,…it is called a countable infinite sample space. If it has
as many point as there are in some interval, such as 0 <x< 1, it is called a non-countable
infinite sample space. A sample space which is finite or countable infinite is often called a
discrete sample space while a set which is non-countable infinite is called continuous sample
space.
Equally likely outcomes are outcomes of an experiment which has equal chance (equally
probable) to appear. In most cases it is commonly assumed finite or countable infinite sample
space is equally likely. If we have n equally likely outcomes in the sample space then the
probability of the ith sample point xi is p (xi) =1/n, where xi can be the first, second,... or the
nth outcome.
Example: In an experiment tossing a fair die, the outcomes are equally likely (each outcome
is equally probable. Hence, P (xi = 1) = P (xi = 2) = P (xi = 3) = P (xi = 4) = P (xi = 5) = P (xi
= 6) =1/6
If the number of possible outcomes in an experiment is small, it is relatively easy to list and
count all possible events. When there are large numbers of possible outcomes an enumeration
of cases is often difficult, tedious, or both. Therefore, to overcome such problems one can use
various counting techniques or rules.
145
Basic Statistics
In order to determine the number of outcomes, one can use several rules of counting.
To list the outcomes of the sequence of events, a useful device called tree diagram is used.
1. ADDITION RULE
Suppose that a procedure designated by 1, can be performed in n 1 ways. Assume that second
k procedures and i th procedure may be performed in n i ways, i=1, 2, …, k , then the number
k
of ways in which we perform procedure 1 or 2 or … or k is given by n 1 +n 2 +…+ n k = ni ,
i 1
Example1: - Suppose that we are planning a trip and are deciding between bus and train
transportation. If there are 3 bus routes and 2 train routes to go from A to B, find the available
routes for the trip. There are 3+2 = 5 possible routes for someone to go from A to B.
Example2: A student goes to the nearest snack to have a breakfast. He can take tea, coffee,
or milk with bread, cake and sandwich. How many possibilities does he have?
Solutions:
Tea with Bread
With Cake
With Sandwich
Coeffee with Bread
With Cake
with Sandwich
milk with Bread
146
Basic Statistics
With Cake
With Sandwich
If a choice consists of k steps of which the first can be made in n1 ways, the second can be
made in n2 ways, …, the kth can be made in nk ways, then the whole choice can be made in
Example1: The digits 0, 1, 2, 3, and 4 are to be used in 4 digit identification card. How many
different cards are possible if a) Repetitions are permitted.
Solutions
a)
b)
Example2: -An airline has 6 flights from A to B, and 7 flights from B to C per day. If the
flights are to be made on separate days, in how many different ways can the airline offer from
A to C?
Solution: In operation 1 there are 6 flights from A to B, 7 flights are available to make flight
from B to C. Altogether there are 6*7 = 42 possible flights from A to C.
Example3: - suppose that in a medical study patients are classified according to their blood
type as A, B , AB, and O; according to their RH factors as + or - and according to their
blood pressure as high, normal or low ,then in how many different ways can a patient be
classified ?
Solution
The 1st classification has done in 4 ways; the 2nd in 2 ways, and the 3rd in 3 ways. Thus,
patient can be classified in 4*2*3 = 24 different ways.
Example4:- Suppose that a bank has two branches, each branch has two departments, and
each department has four employees. Then there are (2)(2)(4) choices of employees, and the
148
Basic Statistics
probability that a particular one will be randomly selected is 1/(2)(2)(4) = 1/16. We may view
the choice as done sequentially: First a branch is randomly chosen, then a department within
the branch, and then the employee within the department.
Permutation
Permutation Rules:
2. The arrangement of n objects in a specified order using r objects at a time is called the
n!
P
n r
(n r )!
3. The number of permutations of n objects in which k1 are alike k2 are alike etc is
n!
k1!*k2 * ... * kn
Example1:
149
Basic Statistics
Solutions: 1. a)
b)
Here n 4, r 2
4! 24
There are 4 P2 12 permutations.
(4 2)! 2
Heren 10
Of which 2 areC , 2 areO, 2 are R ,1E,1T ,1I ,1N
K1 2, k 2 2, k3 2, k 4 k5 k 6 k7 1
U sin g the 3rd ruleof permutation , thereare
10!
453600 permutations.
2!*2!*2!*1!*1!*1!*1!
Example2: -Jimma University Registrar Office wants to give identity number for students by
using 4 digits. The number should be considered by the following numbers only: {0, 1, 2, 3, 4,
5, and 6}. Hence, how many different ID Numbers could be given by the Registrar?
Solution
We have 7 possible numbers for 4 digits. But the required number of digits for ID number is
4. Hence n=7 & r = 4. The possible number of Id.No. Given for student without repeating
the number is
n!
nPr, =
n r !
7!
= 7*6*5*4 = 840.
7 4!
150
Basic Statistics
The possible number of ID.No. given for student with repeating the number is
nr = 74 = 7*7*7*7 = 2401
Exercises:
Six different statistics books, seven different physics books, and 3 different Economics books
are arranged on a shelf. How many different arrangements are possible if;
Combination
Example: Given the letters A, B, C, and D list the permutation and combination for selecting
two letters.
Solutions:
Combination
AB BA CA DA AB BC
AC BC CB DB
AD BD CD DC AC BD
AD DC
Note that in permutation AB is different from BA. But in combination AB is the same as BA.
Combination Rule
n
C
n r or and is given by the formula:
r
151
Basic Statistics
n n!
r (n r )!*r!
Example1:
Solutions:
n9 , r 5
n n! 9!
126 ways
r ( n r )!*r! 4!* 5!
2. Among 15 clocks there are two defectives .In how many ways can an inspector chose three of
the clocks for inspection so that:
a) There is no restriction.
b) None of the defective clock is included.
c) Only one of the defective clocks is included.
d) Two of the defective clock is included.
Solutions: n=15 of which 2 are defective and 13 are non-defective; and r=3
a) If there is no restriction select three clocks from 15 clocks and this can be done in :
n 15 , r 3
n n! 15!
455 ways
r ( n r )!*r! 12!*3!
This is equivalent to zero defective and three non-defective, which can be done in:
152
Basic Statistics
2 13
* 286 ways.
0 3
2 13
* 156 ways.
1 2
d) Two of the defective clock is included.
This is equivalent to two defective and one non-defective, which can be done in:
2 13
* 13 ways.
2 3
3 3!
Example2: - The number of combinations of letters a, b& c taken two at a time is = =
2 2!1!
3.
These are ab, ac and bc. Note that ab is the same combination as ba, but not the same
permutation.
Example3: - Suppose in the box 3 red, 3 white and 5 black equal sized balls are there. We
want to draw 3 balls at a time. How many ways do we have from each type?
3 3 5
→ Solution = 3(3)5 = 45 ways.
1 1 1
Exercises:
153
Basic Statistics
a) There is no restriction.
b) The dictionary is selected?
c) 2 novels and 1 book of poems are selected?
There are four different conceptual approaches to the study of probability theory. These are:
Definition: If a random experiment with N equally likely outcomes is conducted and out of
these NA outcomes are favorable to the event A, then the probability that event A occur
Examples:
Solutions:
S 1, 2, 3, 4, 5, 6
N n( S ) 6
A 4 A 2,4,6
N A n( A) 1 N A n( A) 3
n( A) n( A)
P( A) 1 6 P( A) 3 6 0.5
n(S ) n( S )
A 1,3,5 A {}
N A n( A) 3 N A n( A) 0
n( A) n( A)
P( A) 3 6 0.5 P( A) 0 60
n( S ) n(S )
Solutions:
80
Total selection N n( S )
10
155
Basic Statistics
30 50
Total way in which A occur * N A n( A)
10 0
30 50
*
n( A) 10 0
P( A) 0.00001825
n(S ) 80
10
30 50
Total way in which A occur * N A n( A)
4 6
30 50
*
n( A) 4 6
P( A) 0.265
n(S ) 80
10
30 50
Total way in which A occur * N A n( A)
0 10
30 50
*
n( A) 0 10
P( A) 0.00624
n(S ) 80
10
3: -In a given basket there is 3 yellow, 4 black and 3 white balls. What is the probability of
selection of one black ball?
favorable cases to A 4
P (A) = = = 0.4
exhaustive No. of cases 10
156
Basic Statistics
Exercises:
1. What is the probability that a waitress will refuse to serve alcoholic beverages to only three
minors if she randomly checks the I.D’s of five students from among ten students of which
four are not of legal age?
The classic definition of probability has a disadvantage in that of the words “equally likely” is
vague. In fact, since these words seem to be synonymous with “equally probable”, the
definition is circular because we are essentially defining probability in terms of itself.
For this reason, a statistical definition of probability has been advocated by some people.
According to this the estimated probability, or empirical probability, of an event is taken to be
the relative frequency of occurrence of the event when the number of observations is very
large. The probability itself is the limit of the relative frequency as the number of observations
increases indefinitely.
NA
P( A) lim
N N
Example1: If records show that 60 out of 100,000 bulbs produced are defective. What is the
probability of a newly produced bulb to be defective?
157
Basic Statistics
Solution: Let A be the event that the newly produced bulb is defective.
NA 60
P( A) lim 0.0006
N N 100,000
Example2: -If 1000 tosses of a coin result in 529 heads, the relative frequency of heads is
529/1000 = 0.529. If another 1000 tosses results in 493 heads, the relative frequency in the
529 493
total of 2000 tosses is =0.511.
2000
According to the statistical definition, by counting in this manner we should ultimately get
closer and closer to a number that represents the probability of a head in a single toss of the
coin. From the results so far presented, this should be 0.5 to one significant figure.
Axiomatic Approach:
Let E be a random experiment and S be a sample space associated with E. With each event A
a real number called the probability of A satisfies the following properties called axioms of
probability or postulates of probability.
1. P( A) 0
2. P(S ) 1, S is the sure event.
3. If A and B are mutually exclusive events, the probability that one or the other occur equals the
5. P( A' ) 1 P( A)
6. 0 P( A) 1
7. P(ø) =0, ø is the impossible event.
158
Basic Statistics
AUB A∩B
In general p( A B) p( A) p( B) p( A B)
Conditional Events: If the occurrence of one event has an effect on the next occurrence of
the other event then the two events are conditional or dependent events.
Example: Suppose we have two red and three white balls in a bag
Since the first drawn ball is replaced for a second draw it doesn’t affect the second draw. For
this reason A and B are independent. Then if we let
2
A= the event that the first draw is red p( A)
5
2
B= the event that the second draw is red p ( B)
5
This is conditional b/c the first drawn ball is not to be replaced for a second draw in that it
does affect the second draw. If we let
2
A= the event that the first draw is red p( A)
5
159
Basic Statistics
Let B= the event that the second draw is red given that the first draw is red P(B) = 1/4
The conditional probability of an event A given that B has already occurred, denoted by
p ( A B) is
p( A B)
p ( A B) = , p( B) 0
p( B)
(2) p( B' A) 1 p( B A)
Examples
1. For a student enrolling at freshman at certain university the probability is 0.25 that he/she will
get scholarship and 0.75 that he/she will graduate. If the probability is 0.2 that he/she will get
scholarship and will also graduate. What is the probability that a student who get a
scholarship graduate?
160
Basic Statistics
2. If the probability that a research project will be well planned is 0.60 and the probability that it
will be well planned and well executed is 0.54, what is the probability that it will be well
executed given that it is well planned?
Planned
Executed
Exercise: A lot consists of 20 defective and 80 non-defective items from which two items are
chosen without replacement. Events A & B are defined as A = the first item chosen is
defective, B = the second item chosen is defective
Note: for any two events A and B the following relation holds.
pB pB A. p A p B A' . p A'
Probability of Independent Events
161
Basic Statistics
Example; A box contains four black and six white balls. What is the probability of getting
two black balls in drawing one after the other under the following conditions?
Required p A B
As we have already noted in the introduction, the basic objective behind calculating
probabilities is to help us in making decisions by quantifying the uncertainties involved in the
situations. Quite often, whether it is in our personal life or our work life, decision-making is
an ongoing process. Consider for example, a seller of winter garments, who is interested in
the demand of the product. In deciding on the amount he should stock for this winter, he has
computed the probability of selling different quantities and has noted that the chance of
selling a large quantity is very high. Accordingly, he has taken the decision to stock a large
quantity of the product. Suppose, when finally the winter comes and the season ends, he
discovers that he is left with a large quantity of stock. Assuming that he is in this business, he
feels that the earlier probability p him decide on the stock for the next winter. Similar to the
situation of the seller of winter garment, situations exist where we are interested in an event
on an ongoing basis. Every time some new information is available, we do revise our odds
mentally. This revision of probability with added information is formalised in probability
theory with the help of famous
162
Basic Statistics
Bayes' Theorem.
The theorem, discovered in 1761 by the English clergyman Thomas Bayes, has had a
profound impact on the development of statistics and is responsible for the emergence of a
new philosophy of science. Bayes himself is said to have been unsure of his extraordinary
result, which was presented to the Royal Society by a friend in 1763 - after Bayes' death. We
will first understand The Law of Total Probability, which is helpful for derivation of Bayes'
Theorem.
Partition of sample space: A collection of events {B1, B2, . . . , Bn} of a sample space S is
called a partition of S if B1, B2, . . . , Bn are mutually exclusive and B1∪ B2 ∪ ·· · ∪ Bn= S.
Theorem of total probability: If the events B1, B2, . . . , Bn constitute a partition of the
sample space S such that P(Bi) ≠0 for i= 1, 2, . . . , n, then for any event A of S, P(A) = P(A ∩
B1) +
Example: In a certain assembly plant, three machines, B1, B2, and B3, make 30%, 45%, and
25%, respectively, of the products. It is known from past experience that 2%, 3%, and 2% of
the products made by each machine, respectively, are defective. Now, suppose that a finished
product is randomly selected. What is the probability that it is defective?
Solution: Consider the following events: A: the product is defective, B1: the product is made
by machine B1, B2: the product is made by machine B2, B3: the product is made by machine
B3. Then, P(B1) = 0.3, P(B2) = 0.45, P(B3) = 0.25 , P(A|B1) =0.02, P(A|B2) = 0.03, P(A|B3)
= 0.02 Applying the theorem of total probability,
163
Basic Statistics
SELF-ASSESSMENT QUESTIONS
1. Explain what you understand by the term ‘probability’. How is the concept of?
2. What are different approaches to the definition of probability? Are these approaches
contradictory to one another? Which of these approaches you will apply for Calculating the
probability that:
(c) Mr. Bhupinder S. Hooda will win the assembly election from Kiloi.
164
Basic Statistics
5. Explain the concept of conditional probability with the help of a suitable example.
7. State the Bayes’ Theoram of probability. Using an appropriate example, develop the
(a) In how many ways we can select three players out of 12 players of the Indian Cricket
team, for playing in the World XI team?
(b) In how many ways can a sub-committee of 2 out of 6 members of the executive committee
of the employees’ association be constituted?
9. What is the probability that a non-leap year, selected at random, will contain
10. A card is drawn at random from well shuffled deck of 52 cards, find the probability
that
11. From a well-shuffled deck of 52 cards, two cards are drawn at random.
165
Basic Statistics
(a) If the cards are drawn simultaneously, find the probability that these consists of
(b) If the cards are drawn one after the other with replacement. Find the probability that these
consists of
Solving it are 1/2, 1/3, 1/4 and 1/5 respectively. Find the probability that the problem
Will
(a) Be solved
13. The odds that A speaks the truth are 3:2 and the odds that B does so are 7:3. In what
Percentage of cases are they likely to
166
Basic Statistics
14. Among the sales staff engaged by a company 60% are males. In terms of their
professional qualifications, 70% of males and 50% of females have a degree in marketing.
Find the probability that a sales person selected at random will be
15. A factory has three units A, B, and C. Unit A produces 50% of its products, and units B
and C each produce 25% of the products. The percentage of defective items produced by A,
B, and C units are 3%, 2% and 1%, respectively. If an item is selected at random from the
total production of the factory is found defective, what is the probability that it is produced
by:
(a) Unit A
(b) Unit B
(c) Unit C
167
Basic Statistics
Introduction
In many situations, our interest does not lie in the outcomes of an experiment as such; we may
find it more useful to describe a particular property or attribute of the outcomes of an
experiment in numerical terms. For example, out of three births; our interest may be in the
matter of the probabilities of the number of boys. Consider the sample space of 8 equally
likely sample points.
Now look at the variable “the number of boys out of three births”. This number varies among
sample points in the sample space and can take values 0,1,2,3, and it is random –given to
chance.
Discrete if it takes only a countable number of values. For example, number of dots on two
dice, number of heads in three coin tossing, number of defective items, number of boys in
three births and so on.
Continuous if can take on any value in an interval of numbers (i.e. its possible values are
unaccountably infinite). For example, measured data on heights, weights, temperature, and
time and so on.
A random variable has a probability law - a rule that assigns probabilities to different values
of the random variable. This probability law - the probability assignment is called the
probability distribution of the random variable. We usually denote the random variable by X.
168
Basic Statistics
In this lesson, we will discuss discrete probability distributions and Continuous probability
distributions.
E.g.1: suppose a coin is tossed three times. Let X be the number of heads.
Solution: If we toss a coin three times, then the experiment has a total of eight possible
outcomes, and they are as follows: S= {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}.
Since X is the characteristic, which denotes the number of heads out of the three tosses, ar.v,
X is associated with each outcome of this experiment. Therefore, X is a function defined on
the elements of S and the possible values of X are {0, 1, 2, and 3}.
Discrete Random variable – let x be ar.v. If the number of possible values of x is finite or
countable infinite, we call x a discrete r.v.
- Let x be discrete r.v with each possible outcome x, we associate a number P (xi) = P (X=xi)
called the probability of x. The numbers P (xi) must satisfy the following requirements for
probability distribution
169
Basic Statistics
1. The sum of the probabilities of all the events in the sample space must be equal to 1.i.e
2. The probability of each event in the sample space must be between or equal to zero and one
(0&1), i.e. 0 .
The random variable X denoting “the number of boys out of three births”,is a discrete random
variable; so it will have a discrete probability distribution. It is easy to visualize that the
random variable X is a function of Sample space. We can see the correspondence of sample
points with the values of the random variable as follows:
The correspondence between sample points and the value of the random variable allows us to
determine the probability distribution of X as follows:
The above probability statement constitute the probability distribution of the random variable
X = number of boys in three births. We may appreciate how this probability law is obtained
simply by associating values of X with sets in the sample space. (For example, the set GGB,
GBG, BGG leads to X = 1). We may write down the probability distribution of X in table
format or we may plot it graphically by means of probability Histogram or a Line chart.
170
Basic Statistics
Exercise: Tossing a coin twice. Let x be the number of heads. Construct a probability
distribution.
The probability distribution of a discrete random variable lists the probabilities of occurrence
of different values of the random variable. We may be interested in cumulative probabilities
of the random variable. That is, we may be interested in the probability that the value of the
random variable is at most some value x. This is the sum of all the probabilities of the values I
of X that are less than or equal to x.
171
Basic Statistics
The cumulative distribution function (also called cumulative probability function) F(X =x) of
a discrete random variable X is
For example, to find the probability of at most two boys out of three births, we have
= 7/8
Continuous Random variable – x is continuous if it assume all values in some interval (c, d)
where c, d ε R and there exist a function f, called the probability density function (pdf) of x
satisfying the following conditions.
a. f(x)≥0.∀x
b. =1
c. For any a and b with -∞<a<b<∞, we have
P (a<x<b) =
Remark: a. P(x=a)=0
b. P(a < x < b)= P(a ≤ x ≤ b)= P(a < x ≤ b)= P(a ≤ x < b)
172
Basic Statistics
a.
x 0 1 2 3
p(x) ¼ ¼ ¼ ¼
X 0 1 2 3
b. P(x) -1 ½ ¼ ¼
c. x 1 2 3 4
P(x) ¼ ¼ ½ ¼
Defn: - let x be a discrete r.v with possible values x1, x2, x3… xn… with probability P(x1),
P(x2) … P (xn) … respectively. Then the expected value of x or the mean value of x denoted
by E(x) or μx respectively is defined as μx=E(x) =
μx=E(x) =
E(x) =1/n ( )
173
Basic Statistics
E.g.4: In a family with two children, Find the mean of the number of children who will be
girls.
E.g.5: One thousand tickets are sold at $1 each for a color television valued at $ 350. What is
the expected value of the girls if a person purchases one ticket?
Var(x) = = E{x-E(x)}2 or
Var(x) = = -
= =
E.g.6: Find the mean and variance of the number of spots that appear when a die is rolled
Find the mean and variance of the number of spots that appear when a die is rolled
X 1 2 3 4 5 6
P(x) 1/6 1/6 1/6 1/6 1/6 1/6
174
Basic Statistics
Consider a random variable X that measure the “number of heads” in a three-trial coin tossing
experiment. The probability distribution of X will be
X 0 1 2 3
P(X=x) 1/8 3/8 3/8 1/8
Now imagine this experiment is repeated 200 times, we may expect ‘no head’ and ‘three
heads’ will each occur 25 times; ‘one head’ and ‘two heads’ each will occur 75 times. Since
these results are what we expect on the basis of theory, the resultant distribution is called a
theoretical or expected distribution.
However, when the experiment is actually performed 200 times, the results, which we may
actually obtain, will normally differ from the theoretically expected results. It is quite possible
that in actual experiment ‘no head’ and ‘three heads’ may occur 20 and 28 times respectively
and ‘one head’ and ‘two heads’ may occur 66 and 86 times respectively. The distribution so
obtained through actual experiment is called the empirical or observed distribution.
In practice, however, assessing the probability of every possible value of a random variable
through actual experiment can be difficult, even impossible, especially when the probabilities
are very small. But we may be able to find out what type of random variable the one at hand is
by examining the causes that make it random. Knowing the type, we can often approximate
the random variable to a standard one for which convenient formulae are available.
The proper identification of experiments with certain known processes in Probability theory
can help us in writing down the probability distribution function. Two such processes are the
Bernoulli Process and the Poisson Process. The standard discrete probability distributions
that are consequent to these processes are the Binomial and the Poisson distribution. We will
now look into the conditions that characterize these processes, and examine the standard
distributions associated with the processes. This will enable us to identify situations for which
these distributions apply.
175
Basic Statistics
Let us first study the Bernoulli random variable, named so in honor of the mathematician
Jakob Bernoulli (1654-1705). It is the building block for other random variables and the
resulting distributions we will study in this lesson.
Suppose an operator uses a lathe to produce pins, and the lathe is not perfect in the sense that
it does not always produce a good pin. Rather, it has a probability p of producing a good pin
and (1 - p) of producing a defective one. Let us denote a good pin as “success” and a
defective pin as “failure”.
Just after the operator produces one pin, it is inspected; let X denote the "number of good pins
produced” i. e. “the number of successes”. Now analyzing the trial- “inspecting a pin” and
our random variable X-“number of successes”, we note two important points:
The trial-“inspecting a pin” has only two possible outcomes, which are mutually exclusive.
Such a trial, whose outcome can only be either a success or a failure, is a Bernoulli trial. In
other words, the sample space of a Bernoulli trial is
S = {success, failure}
The random variable, X, that measures number of successes in one Bernoulli trial, is a
Bernoulli random variable. Clearly, X is 1 if the pin is good and 0 if it is defective.
X: 0 1
P(X): p 1-p
X ~ BER (p)
Where ~ is read as “is distributed as” and BER stands for Bernoulli.
176
Basic Statistics
A Bernoulli random variable is too simple to be of immediate practical use. But it forms the
building block of the Binomial random variable, which is quite useful in practice. The
binomial random variable in turn is the basis for many other useful cases, such as Poisson
random variable.
2. Binomial Distribution
In the real world we often make several trials, not just one, to achieve one or more successes.
Suppose an operator produces n pins, one by one, on a lathe that has probability p of making a
good pin at each trial, the sequence of numbers (1 or 0) denoting the good and defective p in s
produced in each of the n trials is a Bernoulli process. For example, in the sequence of nine
trials denoted by
001011001
The third, fifth, sixth and ninth are good pins, or successes. The rest are failures. In practice,
we are usually interested in the total number of good pins rather than the sequence of 1's and
0's. In the example above, four out of nine are good. In the general case, let X denote the total
number of good pins produced in n trials. We then have X = X1 + X2 +………+ Xn where all
Xi ~ BER(p) and are independent.
The random variable that counts the number of successes in many independent, identical
Bernoulli trials is called a Binomial Random Variable.
may result in an outcome which possesses or which does not possess a specified character.
Our primary interest will be either of these possibilities. Conventionally, the outcome of
primary interest is termed as success. The alternative outcome is termed as failure. These
terminologies are used irrespective of the nature of the outcome. For example, non-
germination of a seed may be termed as success.
The variable X which represents the count of the number of successes in Bernoulli trials will
be a discrete random variable. The probability distribution of such discrete random variable X
is called the binomial distribution.
The expected value of the binomial distribution is np and the variance of it is npq.
Remark
n
E( X ) x P( X x)
x 0
n
= x
x 0
n
c x p x q n x
n
= x x 0
n
c x p x q n x
n
n!
= x x!(n x)! p q
x 0
x n x
n
n(n 1)!
= x x( x 1)!(n x)! p p
x 0
x 1 n x
q
n
(n 1)!
= np ( x 1)!(n x)! p
x 1
x 1 n x
q
n
= np
x 1
n1
c x1 p x1q n x
= np(q p) n1
n 1
= np(1) [ q p 1 ]
= np
V ( X ) E( X 2 ) [ E( X )]2
179
Basic Statistics
Now,
n
E( X 2 ) = = x
x 0
2 n
c x p x q n x
n
= [ x( x 1) x]
x 0
n
c x p x q n x
n n
n! n!
= x( x 1)
x 0 x!(n x)!
p x q n x + x
x 0 x!(n x)!
p x q n x
n
n(n 1)(n 2)!
= x( x 1) x( x 1)( x 2)!(n x)! p
x 0
2
p x 2 q n x E ( X )
n
(n 2)!
= n(n 1) p
2
( x 2)!(n x)! p
x 2
x 2
q n x np
n
= n(n 1) p
2
x 2
n 2
c x2 p x2 q n x np
= n(n 1) p
2
(q p) n2 np
= n(n 1) p
2
(1) n2 np [ q p 1 ]
V (X ) n(n 1) p np - (np) 2
2
= np(np p 1 np)
180
Basic Statistics
= np(1 p)
= npq
The binomial distribution approaches normal distribution as the number of trials n tends to
large (n→ ) for any fixed value of p. A rule of thumb is that for p < 0.5, the normal
approximation is adequate if np > 15. Departures from the given conditions result in less
accurate approximations.
When n is very large and p is very small (n→∞ &p→0) the binomial distribution approaches
Poisson distribution.
Example1: -A given mid-exam contains 10 multiple choice questions, and each question has
four alternatives with one exact answer. Find the probability that the student exactly answered
Solution
Using binomial distribution we can get the probability value easily. That is n = 10,
q = 1- p = 1- ¼ = ¾
181
Basic Statistics
Example 2: Find the probability of getting five heads and seven tails in 12 flips of a balanced coin.
Solution: Given n = 12 trials. Let X be the number of heads.Then, p = Prob. of getting a head =1/2, and
q = prob. of not getting a head=1/2. Therefore, the probability of getting k heads in a random
trial of a coin 12 times is:
12
12 12
12 x 5 792 0.1934
12 1 1
x
P ( X 5)
P( X x) x x . And for x =5,
12 4096 4096 .
x 2 2 2 4096
Example 3: If the probability is 0.20 that a person traveling on a certain airplane flight will request a
vegetarian lunch, what is the probability that three of 10 people traveling on this flight will
request a vegetarian lunch?
3
Checklist 2
Put a tick mark (√) for each of the following questions if you can solve the problems, and an
X otherwise.
Exercise 1
1. The probability that a patient recovers from a rare blood disease is 0.4. If 100 people are
known to have contracted this disease, what is the probability that less than 30 survive?
2. A multiple-choice quiz has 200 questions each with 4 possible answers of which only 1 is the
correct answer. What is the probability that sheer guess-work yields from 25 to 30 correct
answers for 80 of the 200 problems about which the student has no knowledge?
3. A component has a 20% chance of being a dud. If five are selected from a large batch, what is
the probability that more than one is a dud?
4. A company owns 400 laptops. Each laptop has an 8% probability of not working. You
randomly select 20 laptops for your salespeople. (a) What is the likelihood that 5 will be
broken?(b) What is the likelihood that they will all work?
5. A study indicates that 4% of American teenagers have tattoos. You randomly sample 30
teenagers. What is the likelihood that exactly 3 will have a tattoo?
6. An XYZ cell phone is made from 55 components. Each component has a .002 probability of
being defective. What is the probability that an XYZ cell phone will not work perfectly?
7. The ABC Company manufactures toy robots. About 1 toy robot per 100 does not work.
You purchase 35 ABC toy robots. What is the probability that exactly 4 do not work?
8. The LMB Company manufactures tires. They claim that only .007 of LMB tires are
defective. What is the probability of finding 2 defective tires in a random sample of 50 LMB
tires?
9. An HDTV is made from 100 components. Each component has a .005 probability of being
defective. What is the probability that an HDTV will not work perfectly?
3. Poisson distribution
In case of binomial distribution the event is dichotomous, and hence there is no possibility of
such multiple occurrences within a single trial. In order to overcome this difficulty we make n
larger and larger. When n is large, the trials are shorter in terms of length of time. As a result,
the probability of occurrence of an event in a single trial would be smaller. It is equivalent of
saying that it is a rare event. The binomial distribution can still be used to represent the
distribution of such random events. However, the computations become tedious since n is
very large. This can be explained by example.
Suppose that the number of insects caught in a trap is being studied and that the data are
collected on the number of insects caught per hour. Assume that the probability that an insect
will be caught in any single minute is 0.06. Assume further that the events of insects being
trapped are mutually independent and the probability p = 0.06 remains same for all the
minutes. We may use the binomial distribution to calculate the number of insects caught per
hour by considering each minute as a separate Bernoulli trial. If x is the number of insects
caught in a minute then we have
60
P[X=x] = 0.06 0.94
x 60 x
x
Instead of dividing the hour into minutes the seconds may be used as basic units. Then the
value of p would be reduced to, p=0.06/60=0.001. Considering each second as a Bernoulli
trial, we would have a sample size 60 60=3600 for a period of one hour. The binomial
distribution would now be
3600
P[X=x] = 0.001x 0.9993600 x
x
184
Basic Statistics
Thus when n becomes larger and larger the computations using binomial become tedious.
n
Fortunately, it has been shown by Poisson that the value of p x q n x approaches the value
x
of
np e np
x
, when n becomes large and p becomes small in such a way that the equality, np
x!
= is maintained.
e x
P[X=x] = . In the formula, = np = mean number of times an event
x!
occurs.
The value of e can be obtained directly from mathematical tables. In case of Poisson
distribution the counts of alternative events, i.e., failures are not of interest. This is a contrast
between binomial and Poisson distributions. For Poisson distribution all that we need is np,
the mean number of successes. We need not know about n and p individually. Thus, the
Poisson distribution is determined by the parameter . .The special property of Poisson
distribution is that its mean and variance are same to . i.e. mean = variance = .
Example1: At a parking place the average number of car-arrivals during a specified period of
15 minutes is 2. If the arrival process is well described by a Poisson process, find the
probability that during a given period of 15 minutes
185
Basic Statistics
Solution: Let X denote the number of cars arrivals during the specified period of 15 minutes.
So X
=
b. P(at least two cars will arrive) = P(X ≥2)
=1-P(X<2)
=1-{P(X=0)+P(X=1)}
=1-{ }
=1-{0.1353 + 0.2707}
=1 – 0.4060
= 0.5940
c. P(atmost three cars will arrive) = P(X ≤3)
= 0.8571
d. P(between 1 and 3 cars will arrive) = P(1≤X ≤3)
= 0.8571 –0.1353
= 0.7218
186
Basic Statistics
Example2: -In some experiments it was observed that the incidence of stem fly in black gram
was 6 percent. Suppose we examine 50 black gram plants in a field at random. What is
probability that at most 3 plants will be found to be affected by stem fly?
Solution
e x e 3 30
P[X = x] = P[X = 0] = = e-3
x! 0!
e 3 31
P[X = 1] = = 3e-3
1!
e 3 3 2
P[X = 2] = = 4.5e-3
2!
e 3 33 27e 3
P[X = 3] = 4.5e 3
3! 6
P[X 3 ] = 13e-3
Checklist
Put a tick mark (√) for each of the following questions if you can solve the problems, and an
X otherwise.
4. Hypergeometric Distribution
We are interested in computing probabilities for the number of observations that fall into a
particular category. But in the case of the binomial distribution, independence among trials is
required. As a result, if that distribution is applied to, say, sampling from a lot of items (deck
of cards, batch of production items), and the sampling must be done with replacement of
each item after it is observed. On the other hand, the hypergeometric distribution does not
require independence and is based on sampling done without replacement.
Applications for the hyper geometric distribution are found in many areas, with heavy use in
acceptance sampling, electronic testing, and quality assurance. Obviously, in many of these
fields, testing is done at the expense of the item being tested. That is, the item is destroyed and
hence cannot be replaced in the sample. Thus, sampling without replacement is necessary.
In general, we are interested in the probability of selecting x successes from the M items
labeled successes and n − x failures from the N –M items labeled failures when a random
sample of size n is selected from N items. This is known as a hypergeometric experiment,
that is, one that possesses the following two properties: A random sample of size n is selected
without replacement from N items; and of the N items, M may be classified as successes and
N − Mare classified as failures. The number X of successes of a hypergeometric experiment is
called a hypergeometric random variable.
Definition: The probability distribution of the hypergeometric random variable X, the number
of successes in a random sample of size n selected from N items of which M are labeled
x≤M,n – x ≤ N –M.
The range of x can be determined by the three binomial coefficients in the definition, where x
and n−x are no more than M and N –M, respectively, and both of them cannot be less than 0.
188
Basic Statistics
Usually, when both M(the number of successes)and N − M(the number of failures) are larger
than the sample size n, the range ofa hypergeometric random variable will be x = 0, 1, . . ., n.
Remark
When the number of samples in the lot is large, then the hypergeometeric probability mass
function is approximated in to the probability mass function of a binomial random variable.
Example: Lots of 40 components each are deemed unacceptable if they contain 3 or more
defectives. The procedure for sampling a lot is to select 5 components at random and to reject
the lot if a defective is found. What is the probability that exactly 1 defective is found in the
sample if there are 3 defectives in the entire lot?
Solution: Using the hypergeometric distribution with n = 5, N = 40, M= 3, and x = 1, we find the
plan is not desirable since it detects a bad lot (3 defectives) only about 30% of the time.
Example 1: Two balls are selected at random and removed from a bag containing 5 blue and 3
green balls in succession. Find the pmf of blue balls.
Solution: If we let X: selection of blue balls (success), then given are a = 5 (blue balls), b = 3
(green balls), n = 2.Then, the probability of selecting blue balls is:
C x 3 C2 x , x=0,1,2. So that, f (0) 3 , f (1) 15 , and f (2) 10 .
P( X x) f ( x) 5
8 C2 28 28 28
189
Basic Statistics
ACTIVITY:
An urn contains 8 blue balls and 12 white balls. If five are drawn at random, without
replacement. What is the probability that the sample will contain two blue and three white?
Among 16 applicants for a job, 10 have college degrees. If three of the applicants are
randomly chosen for interviews, what are the probabilities that: (a) none has college degrees;
(b) two have college degrees; (c) one has a college degree; (d) all three have college degrees?
Summary
(np) x e np
by the Poisson distribution as: P( X x ) , for x 0,1, 2, .
x!
The Poisson distribution is used to model rare events and the pmf is
given by:
xe
P( X x) , x 0,1,2, . , where is average number of
x!
successes.
Both the Mean and Variance of a Poisson distribution equal to .
190
Basic Statistics
In the first case, Binomial random variable X1 could take only finite number of integer
values;0,1,2…n; whereas in the second case, Poisson random variable X2 could take an
infinite number of integer value; 0,1,2,3………… The random variables X1 and X2 are
discrete, in the sense that they could be listed in a sequence, finite or infinite. In contrast to
these, let us consider a situation, where the variable of interest may take any value within a
given range.
Suppose we are planning for measuring the variability of an automatic bottling process that
fills ½-liter (500 cm3) bottles with cola. The variable, say X, indicating the deviation of the
actual volume from the normal (average) volume can take any real value - positive or
negative; integer or decimal. This type of random variable, which can take an infinite number
of values in a given range, is called a continuous random variable, and the probability
distribution of such a variable is called a continuous probability distribution. The concepts
and assumption inherent in the treatment of such distributions are quite different from those
used in the context of a discrete distribution. In the present lesson, after understanding the
basic concepts of continuous distributions, we will discuss Uniform, Normal and Exponential
distributions- an important continuous distribution that is applicable to many real-life
processes. A continuous random variable is a random variable that can take on any value in an
interval of numbers.
1. Uniform Distribution
One of the simplest continuous distributions in all of statistics is the continuous uniform
distribution. This distribution is characterized by a density function that is “flat,” and thus
the probability is uniform in a closed interval, say [a, b].Suppose you were to randomly select
a number X represented by a point in the interval . The density function of X is
represented graphically as follows.
191
Basic Statistics
Note that the density function forms a rectangle with base b−a and constant height to
ensure that the area under the rectangle equals one. As a result, the uniform distribution is
often called the rectangular distribution.
Definition 9.1:
A random variable of the shown in the above graph is called a uniform random variable.
Therefore, the probability density function for a uniform random variable, X with the
parameters of a and b is given by:
f(x) =
Example 9.1: The department of transportation has determined that the winning (low) bid X (in
dollars) on a road construction contract has a uniform distribution with probability density
function f(x) = , if < x< 2d, where d is the department of transportation estimate of the
cost of job. (a) Find the mean and SD of X. (b) What fraction of the winning bids on road
construction contracts are greater than the department of transportation estimate?
192
Basic Statistics
Activity
Suppose the research department of a steel manufacturer believes that one of the company’s
rolling machines is producing sheets of steel of varying thickness. The thickness X is a
random variable with values between 150 and 200 millimeters. Any sheets less than 160
millimeters thick must be scrapped, since they are unacceptable to buyers. (a) Calculate the
mean and variance of X (b) Find the fraction of steel sheets produced by this machine that
have to be scrapped.
The Normal Distribution is the most versatile of all the continuous probability distributions. It
is being widely used in all data-based research in the field of agriculture, trade, business and
industry It is found to be useful in characterizing uncertainties in many real-life processes, in
statistical inferences, and in approximating other probability distributions. A large number of
random variables occurring in practice can be approximated to the normal distribution.
A random variable that is affected by many independent causes, and the effect of each
cause is not overwhelmingly large compared to other effects, closely follow a normal
distribution.
The lengths of pins made by an automatic machine; the times taken by an assembly worker to
complete the assigned task repeatedly; the weights of baseballs; the tensile strengths of a
batch of bolts; and the volumes of cola in a particular brand of canned cola - are good
examples of normally distributed random variables. All of these are affected by several
independent causes where the effect of each cause is small. This knowledge helps us in
calculating the probabilities of different events in varied situations, which in turn is useful for
decision-making.
In many real life situations, we face the problem of making statistical inferences about
processes based on limited data. Limited data is basically a sample from the full body of data
on the process. Irrespective of how the full body of data is distributed, it has been found that
the Normal Distribution can be used to characterize the sampling distribution of many of the
sample statistics. This helps considerably in Statistical Inferences.
193
Basic Statistics
Finally, the Normal Distribution can be used to approximate certain probability distributions.
This helps considerably in simplifying the probability calculations.
The probability distribution of a normal distribution with mean μ and variance is given by
P (a<x<b) =
But, this integral is a definite integral which tedious to compute, to overcome this problem we
standardize the value and we use the table of standard normal distribution to compute the
probabilities.
194
Basic Statistics
Example: A cost accountant needs to forecast the unit cost of a product for the next year. He
notes that each unit of the product requires 10 labor hours and 5 kg of raw material. In
addition, each unit of the product is assigned an overhead cost of Rs 200. He estimates that
the cost of a labor hour next year will be normally distributed with an expected value of Rs 45
and a standard deviation of Rs 2; the cost of raw material will be normally distributed with an
expected value of Rs 60 and a standard deviation of Rs 3. Find the distribution of the unit cost
of the product. Find its expected value and variance.
Solution: Since the cost of labor L may not influence the cost of raw material M, we can
assume that the two are independent. This makes the unit cost of the product Q a random
variable. So if
= 950
= 100(4) + 25(9)
= 625
So Q ~ N (950, 252)
The standard normal distribution- is a normal distribution with a mean of 0(zero) and a
standard deviation 1.
f(x) =
All normally distributed variables can be transformed into the standard normally distributed
variable by using the formula for the standard score:
Z= or Z =
The probability of any value x lies between two values a and b is given by the area under the
standard normal distribution.
Procedure to find the area under the standard normal distribution curve
1. Between 0 and any Z value: look up the Z value in the table to get the area.
2. In any tail:
197
Basic Statistics
Procedure
Example 1: Find the area under the standard normal distribution which lies.
P (0<Z<0.96) =?
P (0<Z<0.96) =0.3315
P (-1.45<Z<0) =?
198
Basic Statistics
=0.4265
= P (0<Z<0.35) + 0.5
= 0.1368 + 0.5
= 0.6368
= 1- 0.6368
= 0.3632
Example 2: Find the area under the standard normal curve which lies
P (-0.67 <Z<0.75) =?
199
Basic Statistics
=P (-0.67<Z<0) + P (0<Z<0.75)
=0.2486 + 0.2734
=0.522
P (2.13 <z<2.94) =?
=p (0<z<2.94) - p (0<z<2.13)
=0.4984-0.4834=0.015
Example 3: Find z if
P (0<Z<z) = 0.4726
z=? z=1.92
P (0<Z<z) =0.4868
200
Basic Statistics
The importance of the standard normal distribution derives from the fact that
any normal random variable may be transformed to the standard normal
random variable. If we want to transform X, where X ~ N ( , ), into the
standard normal random variable Z ~ N (0, ), we can do this as follows:
Z=
Note: The table gives the areas between 0 and any z value to the right of 0, and
all areas are positive. Then calculating the value of Z using
- Given a normally distributed r.v x with mean and standard deviation The
probability of any value x lies between two values a and b is given by
=p( <Z< )
201
Basic Statistics
Example: If X ~ N (50, 10 2), find the probability that the value of the random
variable X will be greater than 60.
Solution:
=P( > )
= P( Z >1)
= 0.5000 - 0.3413
= 0.1587
a. between Rs 70 and Rs 71
b. between Rs 69 and Rs 73
c. more than Rs 72 (d) less than Rs 65
X ~ N (70, )
=P( )
202
Basic Statistics
=P( )
=P( > )
=P( < )
203
Basic Statistics
= P(Z <-1.0)
= P(Z >1.0)
= P(Z >0) - P(0 <Z <1.0)
= 0.5 - 0.3413
= 0.1567
So the number of workers whose weekly wages are less than Rs 65
= 2000 x 0.1567
= 313
3. Exponential Distribution
204
Basic Statistics
Definition 9.4:
Remark
A key property possessed only by exponential random variables is that they are
memoryless, in the sense that, for positive s and t, P{X >s + t|X >t} = P{X >s}.
If X represents the life of an item, then the memoryless property states that, for
any t, the remaining life of a t-year-old item has the same probability
distribution as the life of a new item. Thus, one need not remember the age of
an item to know its distribution of remaining life.
205
Basic Statistics
Solution: this distribution is an exponential and the mean and variance it is obtain in
the manner as: E(X) = = 1/3 and V(X) = –
= 1/9.
ACTIVITY 9.3:
SUMMARY
206
Basic Statistics
SELF-ASSESSMENT QUESTIONS
Explain this statement. Also develop and generalize Binomial probability rule
with the help of an example.
Develop the Poisson probability rule from the Binomial probability rule under
these conditions.
5. List some of the important areas where Poisson distribution is used. Also
state the important properties of a Poisson distribution.
207
Basic Statistics
Out of 200 samples of 4 items, find the expected number of samples with (a),
(b), and (c) Above
8. The mean and variance of a binomial distribution are 2 and 1.5 respectively.
Find the probability of
9. 150 random samples of 4 units each are inspected for number of defective
item. The results are: Number of defective items: 0 1 2 3 4
Number of Samples: 28 62 46 10 4
0.002. Find the probability that out of 1000 individuals (a) no, (b) 1, (c) at least
1, and (d) almost 2; individuals will have reaction from the injection.
11. In a razor blades manufacturing factory, there is small chance of 1/500 for
any blade to be defective. The blades are supplied in packets of 10. Find the
approximate number of packets containing (a) no, (b) 1, and (c) 2 defective
blades in a consignment of 10,000 packets.
208
Basic Statistics
EXERCISE
209
Basic Statistics
7. A normal distribution has mean 62.4, find its standard deviation if 20.05% of
the area under the normal curve lies to the right of 72.9
8. A random variable has a normal distribution with standard deviation 5. Find
it’s mean if the probability that the random variable will assume a value less
than 52.5 is 0.6915
210
Basic Statistics
1. The following present a list of different attributes and rules for assigning
numbers to objects. Try to classify the different measurement systems into one
of the four types of scales.
a. The order in which you were eliminated in a spelling bee as a measure of your
spelling ability.
b. Socioeconomic status of a family when classified as low, middle and upper
classes.
2. What are the major limitations of Statistics? Explain with suitable examples.
3. The accompanying data describe the hourly wage rates (dollars per hour) for
30 employees of an electronics firm:
22.66 24.39 17.31 21.02 21.61 20.97 18.58 16.61
19.74 21.57 20.56 22.16 20.16 18.97 22.64 19.62
22.05 22.03 17.09 24.60 23.82 17.80 16.28 19.34
22.22 19.49 22.27 18.20 19.29 20.43
4. If the permutation of the word WHITE is selected at random, how many of the
permutations
i. Begins with a consonant?
ii. Ends with a vowel?
iii. Has a consonant and vowels alternating?
211
Basic Statistics
212
Basic Statistics
213