Chapter 1-4
Chapter 1-4
INTRODUCTION
The word “statistics” could be singular or plural. The definition given in the second
place above might be taken as the singular form of “statistics”.
Statistics, in its singular sense is a subject area or field of study. It is defined as
science, which deals with the collection, processing, analysis, interpretation and
presentation of numerical facts.
The subjects of statistics, as it seems, is not a new discipline but it is as old as the
human society itself. The sphere of its utility, however, was very much restricted.
The word “statistics” is derived from the Latin for “state” indicating the historical
importance of governmental data gathering, which related to demographic
information (military recruitment and tax collecting). Thus, the scope of statistics in
the ancient times was primarily limited to the collection of demographic and property
and wealth data of a country by governments for framing military and fiscal policies.
Now days, statistics is used almost in every field of study, such as natural science,
social science engineering, medicine, agriculture, e t c.
Classification:
Statistics is broadly divided into two categories based on how the collected data
are used.
1. Descriptive Statistics
deals with describing data without attempting to infer anything that goes beyond the
given set of data,
consists of collection, organization, summarization and presentation of data.
2. Inferential Statistics
deals with making inferences and/or conclusions about a population based on data
obtained from a limited sample of observations,
consists of performing hypothesis testing, determining relationships among
variables and making predictions.
Examples:
a) From past figures, it has been predicted that 31 of registered voters will vote in the
November election.
1
b) The average age of a student in Hawassa University is 20.1 years.
c) To determine the most effective dose of a new medication (on the basis of tests
performed with volunteer patients from selected hospitals)
d) To compare the effectiveness of two reducing diets (based on the weights losses of
persons who were taking the diets)
e) There is a relationship between smoking tobacco and an increased risk of developing
cancer.
Statistics can be applied in any field of study which seeks quantitative evidence. For
instance (in engineering)
2
A variable is a characteristic of an object that can have different possible values.
There are two types of variables.
1. Nominal scale: - “Nominal “is a Latin word for “name” This is a scale for
grouping individuals into different categories.
Examples: red, brown, black, short, tall, pass, fail
In this scale, one is different from the other
+, -, *, /, Impossible, comparison is impossible
2. Ordinal scale: - “ordinal” is a Latin word, meaning “order”
3
In this measurement scale
One is different, better/greater and by a certain amount of difference than another
(Possible to add and subtract but multiplication and division are not possible)
37Oc – 35oc = 2oc
45oc – 43 oc= 2oc
40oc = 2(20oc) But this does not imply that an object which is 40 oc is twice as hot
as an object which is 20 oc (oF = 9/5, oc +32)
40 oc → 9/5 x 40 oc + 32 = 104 oF
20 oc → 9/5 x20 oc + 32 = 68 oF
Oc = 5/9 (oF- 32)
Any scientific investigation requires data related to the study. The required data can
be obtained from either a primary source or a secondary source.
4
Primary source: Is a source of data that supplies first hand information for the use of
the immediate purpose.
Primary data: are data originally collected for the immediate purpose.
- Primary data are more expensive than secondary data.
Secondary source: are individuals or agencies, which supply data originally
collected for other purposes by them or others.
- Usually they are published or unpublished materials, records, reports, e
t c.
Secondary data: data collected from a secondary source.
The process of data collection from a primary source may in value.
a) Field trials
b) Laboratory experiments
c) Surveys – census survey
- Sample survey.
5
Chapter Two
Organization and Methods of Data Presentation
2.1 Classification and Tabulation of Data
Classification: - is the process of arranging items/data into classes or categories
according to their similarities and/or differences.
Classification eliminates inconsistency and also brings out the points of similarity
and/or dissimilarity of collected items/data.
Classification is necessary because it would not be possible to0 draw inferences and
conclusions if we have a large set of collected [raw] data.
A frequency distribution is a table that presents data according to some criteria with
the corresponding number of items falling in each class (i.e. with the corresponding
frequencies.)
Generally, there are two basic types of frequency distributions: Ungrouped and
Grouped frequency distributions.
Example: The following data are the ages in years of 20 women who attend health
education last year: 30, 41, 39, 41, 32, 29, 35, 31, 30, 36, 33, 36, 32, 42, 30, 35, 37,
32, 30, and 41.
Construct a frequency distribution for these data.
STEP 1. Find the range of the data:
STEP 2. Construct a table, tally the data and complete the frequency column. The
frequency distribution becomes as follows.
6
Age Tally Frequency
29 / 1
30 //// 4
31 / 1
32 /// 3
33 / 1
35 // 2
36 // 2
37 / 1
39 / 1
41 /// 3
42 / 1
7
A tabular arrangement of class intervals together with their corresponding
cumulative frequency (either less than or more than type; as defined above) is called a
cumulative frequency distribution.
Relative frequency: the frequency a class divided by the total frequency (i.e. sum
of all frequencies) and, if multiplied by 100, gives the percent of values falling in
that class.
Note:
The relative frequency shows what fractional part or proportion of the total
frequency belongs to the corresponding class.
The sum of all the relative frequencies in the frequency distribution is always
1.
Relative cumulative frequency (less than type/ more than type): total of the
relative frequencies above/ below a class inclusively. Or the cumulative frequency
(less than type/more than type) divided by the total frequency. This gives the
percent of values which are less than/more than the upper/lower class boundary.
8
27 53 40 29 63 34 44 32
58 61 38 41 26 50 47 37
Construct a suitable frequency distribution for these data using 8 classes.
STEP 1. Unit of measurement; U= 1year
STEP 2. Max = 65, Min = 26 so that R = 65-26 = 39
STEP 3. It is already determined to construct a frequency distribution having 8 classes.
Class limits 26 – 30 31 – 35 36 – 40 41 – 45 46 – 50 51 – 55 56 – 60 61 – 65
STEP 7. By subtracting 0.5 units of measurement from the lower-class limits and by
adding 0.5 units of measurement to the upper-class limits, we can get lower- and
upper-class boundaries as follows.
Class 25.5– 30.5 30.5– 35.5 35.5– 40.5 40.5– 45.5 45.5– 50.5 50.5– 55.5 55.5– 60.5 60.5– 65.5
boundaries
STEPS 8, 9 and 10 are displayed in the following table (columns 3, 4 and 5&6
respectively).
Class limits Class Tally frequency Cumulative Cumulative
boundaries frequency frequency
(less than (more than
type) type)
26 – 30 25.5 – 30.5 ///// 5 5 40
31 – 35 30.5 – 35.5 ///// 5 10 35
36 – 40 35.5– 40.5 ///// 5 15 30
41 – 45 40.5– 45.5 ///// //// 9 24 25
46 – 50 45.5– 50.5 ///// // 7 31 16
51 – 55 50.5– 55.5 / 1 32 9
56 – 60 55.5– 60.5 // 2 34 8
61 – 65 60.5– 65.5 ///// / 6 40 6
9
Usually diagrams are appropriate for presenting discrete data, whereas graphs are
appropriate for presenting continuous types of data.
10
2. Deviation bar-diagrams
When the data take both positive and negative values (for instance data on profit, net
export, percent change, etc) deviation bar-diagrams are appropriate.
The deviation bar-diagram for the data looks like the following.
3. Broken bar-diagrams
This kind of bar-diagram is used to present data involving a few extreme values where
it will be difficult to accommodate the magnitude of the bars corresponding to these
values within the graph paper. In this case we use pieces of bars with each piece
starting with a jump on the numerical scale.
Example: Data: - Amount of production per a day for four products of a factory.
Product A B C D
Quantity produced (kg/day) 14 35 23 109
4. Component bar-diagrams
When it is desired to show how a total (an aggregate) is divided into component parts,
we use component bar diagram. In such type of bar-diagrams, the bars represent
aggregate value of a variable with each aggregate broken into its component parts and
different colors or designs are used for identification.
11
Example: Represent the following data using bar-charts
Data: Yields of production of farmers in Southern Ethiopia.
Year 1990 EC 1991 EC 1992 EC 1993 EC
Crop
Barley 14 15 26 19
Wheat 10 15 14 25
Maize 2 6 10 3
Total 26 36 50 47
5. Multiple bar-diagrams
Multiple bar-diagrams are used to display data on more than one variable. They are
used for comparing different variables at the same time.
Example: The data given in the above example can be presented by using multiple
bar-diagram as below.
II. Pie-charts
A pie-chart is a circle that is divided into sections or wedgrs according to the
percentages of frequencies in each category of the distribution. The angle of the sector
of a class is obtained by multiplying the ratio of the frequency of the class to the total
frequency by 3600.
Note that pie-charts are usually used for depicting nominal level data.
Example: A survey showed that a car owner spends birr 2,950 per year on operating
expenses. Below is the breakdown of the various expenditure items. Draw an
appropriate chart to portray the data.
12
Expenditure item Amount (in
birr)
Fuel 603
Interest on car loan 279
Repairs 930
Insurance and 646
license
depreciation 492
Total 2,950
How to draw a pie-chart
First find the percentages of each class
Next calculate the degree measures for each class
Finally, using a protractor, put each sector /degree measure/ in a circle and give
a key for explanation.
Key
Fuel
Insurance and license
Repairs
Interest on car loan
Depreciation
III.Pictograms
In pictograms, we represent the data by means of some picture symbols. Here we
decide a suitable picture to represent a definite number of units in which the variable
is measured.
Example: Draw a pictorial diagram to present the following data (number of students
in a certain school for four years.)
13
Let a single picture () represents one thousand students.
1995
1994 Key: = 1000 students
1993
1992
IV.Histogram
A histogram is another way of data presentation which is more suitablke for
frequency distributions with continous classes.
In drawing a pictogram, we put the class boundaries of each class on the horizontal
axis and its respective frequency on the vertical axis.
V. Frequency Polygon
A frequency polygon is a line graph drawn by taking the frequencies of the classes
along the vertical axis and their respective class marks along the horizontal axis. Then
join the cross points by a free hand curve.
Example: Present the data in the previous example using a frequency polygon.
14
Example: the data in the previous example can be presented using either a less than or
a more than cumulative frequency polygon as given below (i) and (ii) respectively.
(i) Less than type cumulative frequency polygon
15
16
Chapter Three
MEASURES OF CENTRAL TENDENCY
For instance a data set consisting of six measurements 21, 13, 54, 46, 32 and 37 is
represented by and where = 21, = 13, = 54, = 46, =
32 and = 37.
Similarly, =
17
3. where and are constant numbers
4.
Example 1: You measure the body lengths (in inches) of 10 full-term infants at birth
and record the following:
17.5 19.5 17.5 19 20
21 18 19.5 18 10.75
Compute the sample mean length of the infants for these data.
Example 2: Monthly incomes of fourth year regular students are given in the
following frequency distribution.
Example: The following table gives the daily wages of laborers. Calculate the average
daily wages paid to a laborer.
Wages in birr 11-13 13-15 15-17 17-19 19-21 21-23 23-25
18
Number of laborers 3 4 5 6 6 4 3
Properties of the Arithmetic Mean
The sum of the deviations of the items from their arithmetic mean is zero. This
means, the algebraic sum of the deviations of a set of numbers
from their mean is zero.
n
That is ( xi x ) 0
i 1
The sum of the squares of the deviations of a set of observations from any
number, say A, is the least only when A= . That is,
When a set of observations is divided into k groups and is the mean of
observations of group 1, is the mean of observations of group2, …, is the
mean of observations of group k , then the combined mean ,denoted by , of
all observations taken together is given by
If a wrong figure has been used in calculating the mean, we can correct if we
know the correct figure that should have been used. Let
denote the wrong figure used in calculating the mean
be the correct figure that should have been used
be the wrong mean calculated using , then the correct mean, ,
is given by
Solution:
Exercise: The average score on the mid-term examination of 25 students was 75.8 out
of 100. After the mid-term exam, however, a student whose score was 41 out of 100
dropped the course. What is the average/mean score among the 24 students?
Weighted Arithmetic Mean
In finding arithmetic mean, all items were assumed to be of equal importance. When
due importance is to be given to each item, that is, when proper importance is
19
required to be given to different data, then we find weighted average. Weights are
assigned to each item in proportion to its relative importance.
If represent values of the items and are the
corresponding weights, then the weighted mean, is given by
Example: A student’s final mark in Mathematics, Physics, Chemistry and Biology are
respectively 82, 80, 90 and 70.If the respective credits received for these courses are
3, 5, 3 and 1, determine the approximate average mark the student has got for one
course.
Solution: We use a weighted arithmetic mean, weight associated with each course
being taken as the number of credits received for the corresponding course.
82 80 90 70
3 5 3 1
Therefore
20
Where lower class boundary of the median class
Sum of frequencies of all class lower than the median class (in other words it
is the cumulative
frequency preceding the median class)
Frequency of the median class and is class width
The median class is the class with the smallest cumulative frequency greater than or equal
to . It can be located by counting of the frequencies beginning from the lowest class.
Examples1: The birth weights in pounds of five babies born in a hospital on a certain day
are 9.2, 6.4, 10.5, 8.1 and 7.8. Find the median weight of these five babies.
Solution: the median is 8.1.
Examples 2: The following table gives the distribution of the weekly wages of employees
of a small firm.
21
The difference between the frequency of the modal class and the next lower
class
The difference between the frequency of the modal class and the next higher
class
is the class width
The modal class is the class with the highest frequency in the distribution.
Examples 1: The marks obtained by ten students in a semester exam in statistics are: 70,
65, 68, 70, 75, 73, 80, 70, 83 and 86. Find the mode of the students’ marks.
Example 2: Find the mode for the frequency distribution of the birth weight (in kilogram)
of 30 children given below.
Weight 2.9-2.3 2.3-2.7 2.7-3.1 3.1-3.5 3.5-3.9 3.9-4.3
No. of children 5 5 9 4 4 3
Merits of mode
- Mode is not affected by extreme values.
- Mode can be calculated even in the case of open-end intervals. And it is not necessary
to know all observations.
Demerits of mode
- Mode may not exist in the series and if it exists it may not be a unique value.
- It does not fulfill most of the requirements of a good measure of central tendency
- It may be unrepresentative in many cases.
3.3.4 Quantiles
Quantiles are values which divides the data set arranged in order of magnitude in to
certain equal parts. They are averages of position (non-central tendency). Some of these
values of quantiles are quartiles, deciles and percentiles.
I. Quartiles: are values which divide the data set in to four equal parts, denoted by
and . The first quartile is also called the lower quartile and the third quartile is the
upper quartile. The second quartile is the median.
For Ungrouped data:
Let be the quartile value for j 1, 2, 3 . Then
22
For Ungrouped data
Let D j be the percentile value for j 1, 2, ... , 9 . Then
23
10 – 15 4620
15 – 20 5200
20 – 25 7250
25 – 30 620
30 – 35 297
35 - 40 355
Chapter Four
Measures of Dispersion (Variation)
4.1 Objectives of Measuring Variation
Variation (dispersion) is the scatter or spread of observations /values/ in a distribution
The average or central value is of little use unless the degree of variation, which
occurs about it, is given. If the scatter about the measure of central tendency is very
large, the average is not a typical value. Therefore, it is necessary to develop a
quantitative measure of the dispersion (or variation) of the values about the average.
Measures of variation are statistical measures, which provide ways of measuring the
extent to which the data are dispersed or spread out.
24
4.2 Absolute and Relative Measures of Dispersion
Measures of dispersion /variation may be either absolute or relative. Absolute
measures of dispersion are expressed in the same unit of measurement in which the
original data are given. These values may be used to compare the variation in two
distributions provided that the variables are in the same units and of the same average
size.
In case the two sets of data are expressed in different units, however, such as quintals
of sugar versus tones of sugarcane or if the average sizes are very different such as
manager’s salary versus worker’s salary, the absolute measures of dispersion are not
comparable. In such cases measures of relative dispersion should be used.
In case grouped data, range is found by taking the difference between the class mark
of the last class and that of the first class. That is, where and
are the class marks of the last class and that of the first class respectively.
462 480 534 624 498 552 606 588 516 570
Solution:
25
Example 2: Find the values of the range and relative range for the following
frequency distribution: which shows the distribution of the maximum loads supported
by a certain number of cables.
Maximum load(in kilo-Newton) Number of cables
93 – 97 2
98 – 102 5
103 – 107 12
108 – 112 17
113 – 117 14
118 – 122 6
123 – 127 3
128 – 132 1
Solution:
The mean deviation about the arithmetic mean is, therefore, given by
for ungrouped data
26
for grouped frequency distribution; where is the
The coefficient of mean deviation (CMD) is the ratio of the mean deviation of the
observations to their appropriate measure of central tendency: the arithmetic mean or
the median.
In general, where A is a measure of central tendency: the arithmetic mean
or the median.
That is, CMD about the arithmetic mean is given by where MD is the
mean deviation calculated about the arithmetic mean. On the other hand CMD about
the median is given by in which case MD is calculated about the median
of the observations.
arithmetic mean, is the class mark of the class, is the frequency of the class
and .
Sample Variance ( )
For ungrouped data
27
Where is the sample
arithmetic mean, is the class mark of the class, is the frequency of the class
and .
The Standard Deviation
Standard deviation is the positive square root of the variance.
Population Standard Deviation ( )
where is the population variance.
Coefficient of Variation
The standard deviation is an absolute measure of dispersion. The corresponding
relative measure is known as the coefficient of variation (CV).
Example: Last semester, the students of Biology and Chemistry Departments took
Stat 273 course. At the end of the semester, the following information was recorded.
Compare the relative dispersions of the two departments’ scores using the appropriate
way.
Solution:
Biology Department Chemistry Department
28
Properties of the Variance and the Standard Deviation
Variance
It removes most of the demerits or drawbacks of the measures of dispersion
discussed so far.
Its unit is the square of the unit of measurement of values. For example, if the
variable is measured in kg, the unit of variance is kg2.
It is calculated based on all the observations/data in the series.
It gives more weight to extreme values and less to those which are near to the mean.
Standard Deviation
It is considered to be the best measure of dispersion.
[Demerits] If the values of two series have different unit of measurement, then we
can not compare their variability just by comparing the values of their respective
standard deviations.
It is calculated based on all the observations/data in the series. Standard deviation is
capable of further algebraic treatment.
Standard deviation is as such neither easy to calculate nor to understand.
Similar to the variance, standard deviation gives more weight to extreme values and
less to those which are near to the mean.
Interpretation:
Example: Two sections were given an exam in a course. The average score was 72
with standard deviation of 6 for section 1 and 85 with standard deviation of 5 for
section 2. Student A from section 1 scored 84 and student B from section 2 scored 90.
Who performed better relative to his/her group?
Solution: Section 1: = 72, = 6 and score of student A from Section 1; = 84
Section 2: = 85, = 5 and score of student B from Section 2; = 90
Z-score of student A:
Z-score of student B:
29
From these two standard scores, we can conclude that student A has performed better
relative to his/her section students because his/her score is two standard deviations
above the mean score of selection 1 while the score of student B is only one standard
deviation above the mean score of section 2 students.
30
31