0% found this document useful (0 votes)
16 views

Chapter 1-4

Uploaded by

Gizaw Fulas
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Chapter 1-4

Uploaded by

Gizaw Fulas
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 31

CHAPTER ONE

INTRODUCTION

1.1 Definition and classification of statistics


Definition:
 Statistics is a collection of numerical facts and data.
 Statistics is a mathematical science dealing with the methods of collection,
organizing the collected data, presentation, analysis and interpretation of the
data.
 Statistics is a subject that deals with numbers and figures describing certain
situations. It primarily deals with numerical data taken by surveys and
summarizes these data in such a way that this summary gives a good
indication about the nature of the data.

The word “statistics” could be singular or plural. The definition given in the second
place above might be taken as the singular form of “statistics”.
Statistics, in its singular sense is a subject area or field of study. It is defined as
science, which deals with the collection, processing, analysis, interpretation and
presentation of numerical facts.

The subjects of statistics, as it seems, is not a new discipline but it is as old as the
human society itself. The sphere of its utility, however, was very much restricted.

The word “statistics” is derived from the Latin for “state” indicating the historical
importance of governmental data gathering, which related to demographic
information (military recruitment and tax collecting). Thus, the scope of statistics in
the ancient times was primarily limited to the collection of demographic and property
and wealth data of a country by governments for framing military and fiscal policies.

Now days, statistics is used almost in every field of study, such as natural science,
social science engineering, medicine, agriculture, e t c.

Classification:
Statistics is broadly divided into two categories based on how the collected data
are used.
1. Descriptive Statistics
deals with describing data without attempting to infer anything that goes beyond the
given set of data,
consists of collection, organization, summarization and presentation of data.

2. Inferential Statistics
deals with making inferences and/or conclusions about a population based on data
obtained from a limited sample of observations,
consists of performing hypothesis testing, determining relationships among
variables and making predictions.

Examples:
a) From past figures, it has been predicted that 31 of registered voters will vote in the
November election.

1
b) The average age of a student in Hawassa University is 20.1 years.
c) To determine the most effective dose of a new medication (on the basis of tests
performed with volunteer patients from selected hospitals)
d) To compare the effectiveness of two reducing diets (based on the weights losses of
persons who were taking the diets)
e) There is a relationship between smoking tobacco and an increased risk of developing
cancer.

1.2 Definition of some basic terms

a) Population: Is the totality (collection) of all objects or items under


consideration. Example: All university students of Ethiopia All staff
members of Debub University.
b) Sample: Is a part of a population taken so that some generation about the
population can be made A sample should be a representative of the
population. Example:
c) Parameter: is a descriptive measure of a population, or summary value
calculated from a population. Examples: Average, Range, proportion,
variance,
d) Statistic: is a descriptive measure of a sample, or summary value
calculated from a sample.
Example: Average, Range, proportion, variance …

1.3 Application and limitation of statistics

Statistics can be applied in any field of study which seeks quantitative evidence. For
instance (in engineering)

 To compare the breaking strength of two types of materials


 To determine the probability of reliability of a product.
 To control the quality of products in a given production process.
 To compare the improvement f yield due to certain additives (fertilizer,
herbicides, (wee decides), e t c

However, Statistics has the following limitations.

a) It dose not study qualitative characteristics directly Examples: Beauty,


honesty, poverty, and standard of living.
b) It dose not study a single individual but deals with aggregate of facts.
Example: The population size of a country for some given year dose not help
us for comparative studies.
c) Statistical results are true only on the average. Examples: The probability of
getting a head in tossing a coin is 1|2 the germination percentage of a given
variety of seed is 80%
d) It is sensitive for misuse: Examples: The number of car accidents committed
in a city in a particular year by women drivers is 10 while that committed by
men drivers is 40. Hence women drivers are safe drivers.
The Indian tale of family members closing a river is a good example.

1.4 TYPES OF VARIABLES AND MEASUREMENT SCALES

2
A variable is a characteristic of an object that can have different possible values.
There are two types of variables.

a) Quantitative variables: are variables that can be quantified or can have


numerical values. Examples: height, area, income, temperature e t c.
b) Qualitative variables: are variables that can not be quantified directly.
Examples: color, beauty, sex, location qualitative variables are also called
categorical variables. And hence we have two types of data; quantitative &
qualitative data.
Quantitative variables can be further classified as
 Discrete variables, and
 Continuous variables
a) Discrete variables are variables whose values are counts.
Examples: number of students, number of households (family size), Number
of pages of a book.
b) Continuous variables are variables that can have any value within an interval.

Examples: weight, Length, Volume, e t c.

There are four types of measurement scales for variables

1. Nominal scale: - “Nominal “is a Latin word for “name” This is a scale for
grouping individuals into different categories.
Examples: red, brown, black, short, tall, pass, fail
 In this scale, one is different from the other
 +, -, *, /, Impossible, comparison is impossible
2. Ordinal scale: - “ordinal” is a Latin word, meaning “order”

 It is a scale for grouping and ordering of individuals in to


different categories.
 Data consisting of an ordering of ranking of measurements are
said to be on an ordinal scale of measurements.
Examples: Faster, taller, shorter, military ranks, ranks in race, e t c.
 One is different from and grater /better/ less than the
other.
 +, -, *, / Are impossible, comparison is possible.

Man, A weighs more than man B


Ethiopian athletes got 1st and 2nd ranks in the 10,000m women’s final in Sydney.
Ordinal scales data contain and convey more information than the nominal scale data,
for relative magnitudes are known, however, quantitative comparisons are impossible.

3. Interval scale: is a measurement scale in which:


 There is no true zero point (arbitrary zero paint)
 There is no physical significance to the zero point.
 There is a constant interval size between any adjacent units on the
measurement scale.
Example: oc oF (Measuring units of temperature)

3
 In this measurement scale
One is different, better/greater and by a certain amount of difference than another
(Possible to add and subtract but multiplication and division are not possible)
37Oc – 35oc = 2oc
45oc – 43 oc= 2oc
40oc = 2(20oc) But this does not imply that an object which is 40 oc is twice as hot
as an object which is 20 oc (oF = 9/5, oc +32)
40 oc → 9/5 x 40 oc + 32 = 104 oF
20 oc → 9/5 x20 oc + 32 = 68 oF
Oc = 5/9 (oF- 32)

60 oF→ 5/9 (60 – 32) = 15.56 oc


30 oF→ 5/9 (30 - 32) = -1.11 oc
 Interval scale data convey better information than nominal and ordinal scale
data.

4. Ratio scale: is a measurement scale in which

There is a constant interval size between any adjacent units on the


measurement scale.
 There exists a zero point on the measurement scale and that there is a physical
significance to these zero points.
Examples: m, cm, kg, km/hr, cm/sec
Year, hour, second, ok, m3, e t c.

 One is different, larger /taller/ better/ less by a certain amount of difference


and so much times than the other.
 (+, -, *, / Are possible on this scale)
 This measurement scale provides better information than interval scale of
measurement

1.5 Sources of data and methods of data collection


Any aggregate of numbers cannot be called statistical data. We say an aggregate of
numbers is statistical data when they are
 Comparable
 Meaningful and
 Collected for a well-defined objective
Raw data: are collected data, which have not been organized numerically.
Examples: 25, 10, 32, 18, 6, 93, 4.
An array: is an arrangement of raw numerical data in ascending or descending order
of magnitude.
 It enables us to know the rang of the data set easy and it also gives us some
idea about the general characteristics of the distribution.

Any scientific investigation requires data related to the study. The required data can
be obtained from either a primary source or a secondary source.

4
Primary source: Is a source of data that supplies first hand information for the use of
the immediate purpose.

 Primary data: are data originally collected for the immediate purpose.
- Primary data are more expensive than secondary data.
Secondary source: are individuals or agencies, which supply data originally
collected for other purposes by them or others.
- Usually they are published or unpublished materials, records, reports, e
t c.
 Secondary data: data collected from a secondary source.
The process of data collection from a primary source may in value.

a) Field trials
b) Laboratory experiments
c) Surveys – census survey
- Sample survey.

5
Chapter Two
Organization and Methods of Data Presentation
2.1 Classification and Tabulation of Data
Classification: - is the process of arranging items/data into classes or categories
according to their similarities and/or differences.

Classification eliminates inconsistency and also brings out the points of similarity
and/or dissimilarity of collected items/data.

Classification is necessary because it would not be possible to0 draw inferences and
conclusions if we have a large set of collected [raw] data.

2.2 Frequency Distributions


Frequency: - is the number of times a certain value or set of values occurs in a
specific group.

A frequency distribution is a table that presents data according to some criteria with
the corresponding number of items falling in each class (i.e. with the corresponding
frequencies.)

Example: A frequency distribution presenting the number of males and females in a


class
Sex Frequency
Male 57
Female 39

Generally, there are two basic types of frequency distributions: Ungrouped and
Grouped frequency distributions.

1. Ungrouped frequency distribution


Ungrouped frequency distribution is a table of all potential raw scored values that
could possibly occur in the data along with their corresponding frequencies.
Ungrouped frequency distribution is often constructed for small set of data or a
discrete variable.

Constructing an ungrouped frequency distribution


To construct an ungrouped frequency distribution, first find the smallest and the
largest raw scores in the collected data. Then make a columnar table of all potential
raw scored values arranged in order of magnitude with the number of times a
particular value is repeated, i.e., the frequency of that value. To facilitate counting
method, tallies can be used.

Example: The following data are the ages in years of 20 women who attend health
education last year: 30, 41, 39, 41, 32, 29, 35, 31, 30, 36, 33, 36, 32, 42, 30, 35, 37,
32, 30, and 41.
Construct a frequency distribution for these data.
STEP 1. Find the range of the data:

STEP 2. Construct a table, tally the data and complete the frequency column. The
frequency distribution becomes as follows.

6
Age Tally Frequency
29 / 1
30 //// 4
31 / 1
32 /// 3
33 / 1
35 // 2
36 // 2
37 / 1
39 / 1
41 /// 3
42 / 1

2. Grouped frequency distribution


When the range of the data is large, the data must be grouped into classes. Grouped
frequency distribution is a frequency distribution when several numbers of data are
grouped into one class.

Some Important Definitions


 Raw data: data collected in original form.
 Array: data arranged, in ascending or descending order.
 Class: the different, on overlapping groups of data.
 Class limits: separate one class in a grouped frequency distribution from another.
The limits could actually appear in the collected data and have gaps between the
limit of one class and the lower limit of the next class.
 Class boundaries: separate one class in a grouped frequency distribution from
another. The boundaries have one more decimal place than the raw data and
therefore do not appear in the collected data. There is on gap between the upper
boundary of one class and the lower boundary of the next class. The lower-class
boundary (LCB) is found by subtracting 0.5 units of measurement from the lower-
class limit (LCL) and the upper-class boundary (UCB) is found by adding 0.5
units of measurement to the upper-class limit (UCL).
That is, LCB=LCL+ U and UCB =UCL + U
 Class width (W): the difference between the upper and lower boundaries of any
class or the lower limits of two consecutive classes, or the upper limits of two
consecutive classes.
N.B. Class width is not equal to the difference between UCL and LCL of the same
class.
 Class mark (M): the mid point of a class interval.
i.e.
 Unit of measurement (U): the smallest difference between any two values of the
variable being measured.
 Cumulative frequency (Cf) less than type: the total frequency of all values
(observations) less than or equal to the upper-class boundary for the given class.
 Cumulative frequency (Cf) more than type: The total frequency of all values
(observations) greater than or equal to the lower-class boundary for the given
class.

7
A tabular arrangement of class intervals together with their corresponding
cumulative frequency (either less than or more than type; as defined above) is called a
cumulative frequency distribution.
 Relative frequency: the frequency a class divided by the total frequency (i.e. sum
of all frequencies) and, if multiplied by 100, gives the percent of values falling in
that class.

Note:
 The relative frequency shows what fractional part or proportion of the total
frequency belongs to the corresponding class.
 The sum of all the relative frequencies in the frequency distribution is always
1.
 Relative cumulative frequency (less than type/ more than type): total of the
relative frequencies above/ below a class inclusively. Or the cumulative frequency
(less than type/more than type) divided by the total frequency. This gives the
percent of values which are less than/more than the upper/lower class boundary.

Guidelines to construct a grouped frequency distribution


STEP 1. Determine the unit of measurement, U
STEP 2. Find the maximum (Max) and the minimum (Min) observation, and then
compute their range, R
STEP 3. Fix the number of classes desired (k). there are two ways to fix k:
– Fix k arbitrarily between 6 and 20, or
– Use Sturge’s Formula: where N is the total
frequency. And round this value of k up to get an integer number.
STEP 4. Find the class widths (W) by dividing the range by the number of classes and
round the number up to get an integer value.
STEP 5. Pick a suitable starting point less than or equal to the minimum value. This
starting point is the lower limit of the first class. Continue to add the class
width to this lower limit to get the rest of the lower limits.
STEP 6. Find the upper-class limits. To find the upper-class limit of the first claa,
subtract one unit of measurement from the lower limit of the second class.
Then continue to add the class width to this upper limit so as to get the rest of
the upper limits.
STEP 7. Compute the class boundaries as: and
Where LCL = lower class limit, UCL= upper class limit, LCB= lower class
boundary and UCB= upper class boundary. The class boundaries are also half way
between the upper limit of one class and the lower limit of the next class.
STEP 8. Tally the data.
STEP 9. Find the frequencies.
STEP 10. (If necessary) Find the cumulative frequencies (more than and less than types).
Example: The number of hours 40 employees spends on their job for the last 7 working
days is given below.
62 50 35 36 31 43 43 43
41 31 65 30 41 58 49 41
37 62 27 47 65 50 45 48

8
27 53 40 29 63 34 44 32
58 61 38 41 26 50 47 37
Construct a suitable frequency distribution for these data using 8 classes.
STEP 1. Unit of measurement; U= 1year
STEP 2. Max = 65, Min = 26 so that R = 65-26 = 39
STEP 3. It is already determined to construct a frequency distribution having 8 classes.

STEP 4. Class width


STEP 5. Starting point = 26 = lower limit of the first class. And hence the lower class
limits become
26 31 36 41 46 51 56 61
STEP 6. Upper limit of the first class = 31-1 = 30. And hence the upper-class limits
become
30 35 40 45 50 55 60 65
The lower and the upper-class limits (Steps 5 and 6) can be written as follows.

Class limits 26 – 30 31 – 35 36 – 40 41 – 45 46 – 50 51 – 55 56 – 60 61 – 65

STEP 7. By subtracting 0.5 units of measurement from the lower-class limits and by
adding 0.5 units of measurement to the upper-class limits, we can get lower- and
upper-class boundaries as follows.
Class 25.5– 30.5 30.5– 35.5 35.5– 40.5 40.5– 45.5 45.5– 50.5 50.5– 55.5 55.5– 60.5 60.5– 65.5
boundaries

STEPS 8, 9 and 10 are displayed in the following table (columns 3, 4 and 5&6
respectively).
Class limits Class Tally frequency Cumulative Cumulative
boundaries frequency frequency
(less than (more than
type) type)
26 – 30 25.5 – 30.5 ///// 5 5 40
31 – 35 30.5 – 35.5 ///// 5 10 35
36 – 40 35.5– 40.5 ///// 5 15 30
41 – 45 40.5– 45.5 ///// //// 9 24 25
46 – 50 45.5– 50.5 ///// // 7 31 16
51 – 55 50.5– 55.5 / 1 32 9
56 – 60 55.5– 60.5 // 2 34 8
61 – 65 60.5– 65.5 ///// / 6 40 6

2.3 Diagrammatic and Graphic Presentation of Data


The data that is presented by a frequency distribution can also be displayed
diagrammatically or graphically.
Diagrams and graphs:
 are techniques for presenting data in visual displays using geometric figures;
 are visual aids which give a bird’s eye view about a given set of numerical
data;
 have greater attraction than mere figures (numbers);
 facilitate comparison of data;
 are easily understandable by anyone who does have no statistical
background

9
Usually diagrams are appropriate for presenting discrete data, whereas graphs are
appropriate for presenting continuous types of data.

There are three common diagrammatic presentations of data: bar-diagram/charts, pie-


chart and pictograms, as well as three common graphic presentations of data:
histogram, frequency polygon, and cumulative frequency polygon (ogive).
I. Bar-diagrams/ Bar-charts
 Bar-diagram is a series of equally spaced bars having equal width and the height of
each bar representing the magnitude or frequency of observations in each group.
 Bar-diagrams are usually used to represent one way or simple frequency
distribution.
 Bar-diagrams can be drawn either horizontally or vertically. Usually horizontal
bar-diagrams are used for qualitatively classified data whereas vertical bar-
diagrams are used for quantitatively classified data.

Example: Horizontal bar-diagram.

There are a number of bar-diagrams. The most common are:


 Simple bar-diagrams
 Deviation (two-way) bar-diagrams
 Broken bar-diagrams
 Component (subdivided) bar-diagrams
 Multiple bar-diagrams
1. Simple bar-diagrams
Simple bar-diagrams are used to depict data of single variable or one-way variable.
Example: The following frequency distribution shows sales of production (in million
birr) of three products for 2004 production year.
Product A B C D
Sale (in million) 14 21 9 17

The bar-diagram presentation for these data is given below.

10
2. Deviation bar-diagrams
When the data take both positive and negative values (for instance data on profit, net
export, percent change, etc) deviation bar-diagrams are appropriate.

Example: Present the following data using a suitable bar-diagram.


Data: Net profit (in thousands birr) in oil sales for five years

Year 1997 1998 1999 2000 2001


Profit (in thousands) 12 -5 14 9 -6

The deviation bar-diagram for the data looks like the following.

3. Broken bar-diagrams
This kind of bar-diagram is used to present data involving a few extreme values where
it will be difficult to accommodate the magnitude of the bars corresponding to these
values within the graph paper. In this case we use pieces of bars with each piece
starting with a jump on the numerical scale.
Example: Data: - Amount of production per a day for four products of a factory.
Product A B C D
Quantity produced (kg/day) 14 35 23 109

4. Component bar-diagrams
When it is desired to show how a total (an aggregate) is divided into component parts,
we use component bar diagram. In such type of bar-diagrams, the bars represent
aggregate value of a variable with each aggregate broken into its component parts and
different colors or designs are used for identification.

11
Example: Represent the following data using bar-charts
Data: Yields of production of farmers in Southern Ethiopia.
Year  1990 EC 1991 EC 1992 EC 1993 EC
Crop
Barley 14 15 26 19
Wheat 10 15 14 25
Maize 2 6 10 3
Total 26 36 50 47

The component bar-diagram for this table is as follows

5. Multiple bar-diagrams
Multiple bar-diagrams are used to display data on more than one variable. They are
used for comparing different variables at the same time.

Example: The data given in the above example can be presented by using multiple
bar-diagram as below.

II. Pie-charts
A pie-chart is a circle that is divided into sections or wedgrs according to the
percentages of frequencies in each category of the distribution. The angle of the sector
of a class is obtained by multiplying the ratio of the frequency of the class to the total
frequency by 3600.

Note that pie-charts are usually used for depicting nominal level data.

Example: A survey showed that a car owner spends birr 2,950 per year on operating
expenses. Below is the breakdown of the various expenditure items. Draw an
appropriate chart to portray the data.

12
Expenditure item Amount (in
birr)
Fuel 603
Interest on car loan 279
Repairs 930
Insurance and 646
license
depreciation 492
Total 2,950
How to draw a pie-chart
 First find the percentages of each class
 Next calculate the degree measures for each class
 Finally, using a protractor, put each sector /degree measure/ in a circle and give
a key for explanation.

Expenditure item Amount (in Percentage Degree


birr) (approx) (approx)
Fuel 603 20 74
Interest on car loan 279 9 34
Repairs 930 32 113
Insurance and 646 22 79
license
depreciation 492 17 60
Total 2,950 100 360
Now we can draw the pie-chart for the data.

Key
Fuel
Insurance and license
Repairs
Interest on car loan
Depreciation

III.Pictograms
In pictograms, we represent the data by means of some picture symbols. Here we
decide a suitable picture to represent a definite number of units in which the variable
is measured.
Example: Draw a pictorial diagram to present the following data (number of students
in a certain school for four years.)

Year 1992 1993 1994 1995


No. of students 2000 3000 5000 7000

13
Let a single picture () represents one thousand students.

1995 
1994  Key: = 1000 students
1993 
1992 

IV.Histogram
A histogram is another way of data presentation which is more suitablke for
frequency distributions with continous classes.
In drawing a pictogram, we put the class boundaries of each class on the horizontal
axis and its respective frequency on the vertical axis.

Example: Draw a histogram presenting the following data.

Frequency Cumulative Cumulative


Class Class Mark Frequency Frequency
Boundaries (less than type) (more than
type)
5.5 – 11.5 8.5 2 2 20
11.5 – 17.5 14.5 2 4 18
17.5 – 23.5 20.5 7 11 16
23.5 – 29.5 26.5 4 15 9
29.5 – 35.5 32.5 3 18 5
35.5 – 41.5 38.5 2 20 2

V. Frequency Polygon
A frequency polygon is a line graph drawn by taking the frequencies of the classes
along the vertical axis and their respective class marks along the horizontal axis. Then
join the cross points by a free hand curve.

Example: Present the data in the previous example using a frequency polygon.

VI.Cumulative Frequency Polygon (Ogive)


Cumulative frequency polygon can be traced on less than or more than cumulative
frequency basis. Place the class boundaries along the horizontal axis and the
corresponding cumulative frequencies (either less than or more than cumulative
frequencies) along the vertical axis. Then join the cross points by a free hand curve.

14
Example: the data in the previous example can be presented using either a less than or
a more than cumulative frequency polygon as given below (i) and (ii) respectively.
(i) Less than type cumulative frequency polygon

(ii) More than type cumulative frequency polygon

15
16
Chapter Three
MEASURES OF CENTRAL TENDENCY

3.1 Objectives of Measuring Central Tendency


The most important aspect of studying the distribution of a sample measurement is the
position of the central value, that is, a representative value about which the
measurements are distributed and when it is convenient to have one figure that is
representative of each group. This figure is known as the average of the group. If the
numbers of the group are arranged in order of magnitude, the averages tend to fall
around the central position in the group, so averages are called measures of central
tendency. In short, any measure intended to represent the center of data set is called a
measure of location or central tendency.

The most important objectives of measuring central tendency are:


 To determining a single value around which the other data will concentrate
 To summarizing/reducing the volume of the data
 To facilitating comparison within one group or between groups of data

Desirable properties of measure of central tendency


We say a measure of central tendency is best if it posses most of the following. It
should:
- be simple to understand and easy to calculate/interpret,
- exist and be unique,
- be rigidly defined by mathematical formula,
- be based on all observations,
- Not be seriously affected by extreme observations,
- Have capable of further statistical analysis and/or algebraic manipulation.
3.2 The Summation Notation (∑)
Let a data set consists of a number of observations, represents by x1 , x 2 , ..., x n where n
(the last subscript) denotes the number of observations in the data and xi is the ith
observation. Then the sum

For instance a data set consisting of six measurements 21, 13, 54, 46, 32 and 37 is
represented by and where = 21, = 13, = 54, = 46, =
32 and = 37.

Their sum becomes 21+13+59+46+32+37=208.

Similarly, =

Some Properties of the Summation Notation


1. = where is a constant number.

2. where b is a constant number?

17
3. where and are constant numbers

4.

3.3 Types of Measures of Central Tendency


Several types of averages or measures of central tendency can be defined, the most
commons are
- the arithmetic means or the mean
- the mode
- the median
The choice of average (measure of central tendency) depends upon which best
represents the property under discussion.
3.3.1. The Arithmetic Mean (The Mean)
The arithmetic mean is defined as the sum of the measurements of the items divided by
the total number of items.
Arithmetic Mean for Ungrouped Frequency Distribution
When the data are arranged or given on the form of ungrouped frequency distribution,
then the formula for the mean is

Example 1: You measure the body lengths (in inches) of 10 full-term infants at birth
and record the following:
17.5 19.5 17.5 19 20
21 18 19.5 18 10.75
Compute the sample mean length of the infants for these data.
Example 2: Monthly incomes of fourth year regular students are given in the
following frequency distribution.

Monthly income 54.5 64.5 74.5 84.5 94.5 104.5 114.5


(birr)
Number of students 6 9 15 25 13 7 5
Compute the mean for these data.
Arithmetic Mean for Grouped Frequency Distribution
If data are given in the form of continuous frequency distribution, the sample mean
can be computed as

Where = the class mark of the i th class; i = 1, 2, …, k


f i = the frequency of the class and k = the number of classes
Note that = the total number of observations.

Example: The following table gives the daily wages of laborers. Calculate the average
daily wages paid to a laborer.
Wages in birr 11-13 13-15 15-17 17-19 19-21 21-23 23-25

18
Number of laborers 3 4 5 6 6 4 3
Properties of the Arithmetic Mean
 The sum of the deviations of the items from their arithmetic mean is zero. This
means, the algebraic sum of the deviations of a set of numbers
from their mean is zero.
n
That is  ( xi  x ) 0
i 1

 The sum of the squares of the deviations of a set of observations from any
number, say A, is the least only when A= . That is,
 When a set of observations is divided into k groups and is the mean of
observations of group 1, is the mean of observations of group2, …, is the
mean of observations of group k , then the combined mean ,denoted by , of
all observations taken together is given by

 If a wrong figure has been used in calculating the mean, we can correct if we
know the correct figure that should have been used. Let
 denote the wrong figure used in calculating the mean
 be the correct figure that should have been used
 be the wrong mean calculated using , then the correct mean, ,
is given by

 If the mean of is , then


a) the mean of will be
b) The mean of will be .
Example 1: Last year there were three sections taking Stat 273 course in Alemaya
University. At the end of the semester, the three sections got average marks of 80, 83
and 76. There were 28, 32 and 35 students in each section respectively. Find the mean
mark for the entire students.
Solution:
79.54
Example 2: An average weight of 10 students was calculated to be 65 kg, but latter, it
was discovered that one measurement was misread as 40 kg instead of 80 kg.
Calculate the corrected average weight.

Solution:
Exercise: The average score on the mid-term examination of 25 students was 75.8 out
of 100. After the mid-term exam, however, a student whose score was 41 out of 100
dropped the course. What is the average/mean score among the 24 students?
Weighted Arithmetic Mean
In finding arithmetic mean, all items were assumed to be of equal importance. When
due importance is to be given to each item, that is, when proper importance is

19
required to be given to different data, then we find weighted average. Weights are
assigned to each item in proportion to its relative importance.
If represent values of the items and are the
corresponding weights, then the weighted mean, is given by

Example: A student’s final mark in Mathematics, Physics, Chemistry and Biology are
respectively 82, 80, 90 and 70.If the respective credits received for these courses are
3, 5, 3 and 1, determine the approximate average mark the student has got for one
course.
Solution: We use a weighted arithmetic mean, weight associated with each course
being taken as the number of credits received for the corresponding course.
82 80 90 70
3 5 3 1

Therefore

Average mark of the student for one course is approximately 82.


Merits of Arithmetic Mean
 Arithmetic mean is rigidly defined a mathematical formula so that its value is
always definite.
 It is calculated based on all observations.
 Arithmetic mean is simple to calculate and easy to understand. It doesn’t need
arraying (arranging in increasing or decreasing order) of the data.
 Arithmetic mean is also capable of further algebraic treatment.
 It affords a good standard of comparison.

Drawbacks of Arithmetic Mean


 It is highly affected by extreme (abnormal) observations in the series. For instance,
the monthly incomes of three boys are 37-birr, 53 birr and 48 birr and that of their
father is 1026 birr. The average income become for one of these four people
becomes 219 birr which is not at all a representative figure.
 It can be a number which does not exist in the series.
 It sometime gives such results which appear almost absurd. For example it is likely
that we can get an average of ‘3.6 children’ per family.
 It gives greater importance to bigger items of a series and lesser importance to
smaller items. That means it is an upward bias measure.
 It can’t be calculated for open-ended classes.

3.3.2 The Median


The median of a set of items (numbers) arranged in order of magnitude (i.e. in an array
form) is the middle value or the arithmetic mean of the two middle values. We shall
denote the median of x1 , x 2 , ..., x n by . For ungrouped data the median is obtained by

For grouped data the median, obtained by interpolation method, is given by

20
Where lower class boundary of the median class
Sum of frequencies of all class lower than the median class (in other words it
is the cumulative
frequency preceding the median class)
Frequency of the median class and is class width
The median class is the class with the smallest cumulative frequency greater than or equal
to . It can be located by counting of the frequencies beginning from the lowest class.
Examples1: The birth weights in pounds of five babies born in a hospital on a certain day
are 9.2, 6.4, 10.5, 8.1 and 7.8. Find the median weight of these five babies.
Solution: the median is 8.1.
Examples 2: The following table gives the distribution of the weekly wages of employees
of a small firm.

Wages in birr No. of employees


126 and below 3
127 – 135 5
136 – 144 9
145 – 153 12
154 – 162 5
163 – 171 4
172 and above 2
a) Find the median weekly wage.
b) Why is the median a more suitable measure of central tendency than the
mean in this case?
Merits of median
 Median is a positional average and hence it is not influenced by extreme values.
 Arithmetic mean is rigidly defined a mathematical formula so that its value is
always definite.
 Median can be calculated even in case of open-ended intervals.
 It gives best result in a study of those phenomena’s which are incapable of direct
quantitative measurement. Example: intelligence
Demerits of median
 It is not capable of further algebraic treatment.
 It is not a good representative of the data if the number of items (data) is small.
 The arrangement of items in order of magnitude is sometimes very tedious process if
the number of items is very large.
3.3.3 The Mode
The mode or the modal value is the most frequently occurring score/observation in a
series and denoted by . Note that the mode may not exist in the series or, even if it does
exist, it may not be unique.

For grouped data, the mode is found by the following formula:

Where lower class boundary of the modal class

21
The difference between the frequency of the modal class and the next lower
class
The difference between the frequency of the modal class and the next higher
class
is the class width
The modal class is the class with the highest frequency in the distribution.
Examples 1: The marks obtained by ten students in a semester exam in statistics are: 70,
65, 68, 70, 75, 73, 80, 70, 83 and 86. Find the mode of the students’ marks.
Example 2: Find the mode for the frequency distribution of the birth weight (in kilogram)
of 30 children given below.
Weight 2.9-2.3 2.3-2.7 2.7-3.1 3.1-3.5 3.5-3.9 3.9-4.3
No. of children 5 5 9 4 4 3
Merits of mode
- Mode is not affected by extreme values.
- Mode can be calculated even in the case of open-end intervals. And it is not necessary
to know all observations.
Demerits of mode
- Mode may not exist in the series and if it exists it may not be a unique value.
- It does not fulfill most of the requirements of a good measure of central tendency
- It may be unrepresentative in many cases.

3.3.4 Quantiles
Quantiles are values which divides the data set arranged in order of magnitude in to
certain equal parts. They are averages of position (non-central tendency). Some of these
values of quantiles are quartiles, deciles and percentiles.
I. Quartiles: are values which divide the data set in to four equal parts, denoted by
and . The first quartile is also called the lower quartile and the third quartile is the
upper quartile. The second quartile is the median.
 For Ungrouped data:
Let be the quartile value for j 1, 2, 3 . Then

 For grouped data


We can apply the following formula:

Where the quartile which is to be worked out


Lower class boundary of the quartile class
Sum of frequencies of all classes lower than the quartile class
Frequency of the quartile class and Class width
The quartile class is the class with the smallest cumulative frequency greater than or
equal to . It can be located by counting of the frequencies beginning from the
lowest class.
II. Deciles: are values dividing the data in to ten equal parts, denoted by .
The fifth decile is the median.

22
 For Ungrouped data
Let D j be the percentile value for j 1, 2, ... , 9 . Then

 For grouped data


We can apply the following formula:

Define the symbols similar way as we did in the case of quartiles.


The decile class is the class with the smallest cumulative frequency greater than or
equal to . It can be located by counting of the frequencies beginning from the
lowest class.
III. Percentiles: are values which divide the data in to one hundred equal parts, denoted by
. The fiftieth percentile is the median.
 For ungrouped data:Let be the percentile value for . Then

 For grouped data: We can use the following formula:

Define the symbols similar way as we did in the case of quartiles.


The percentile class is the class with the smallest cumulative frequency greater than or
equal to . It can be located by counting of the frequencies beginning from the
lowest class.
Interpretations
1. is the value below which of the observations in the series are
found (where j 1, 2, 3 ). For instance means the value below which 75 percent of
observations in the given series are found.
2. D j Is the value below which of the observations in the series are
found (where j 1, 2, ... , 9 ). For instance is the value below which 40 percent of the
values are found in the series.
3. is the value below which of the total observations are found (where
). For example 73 percent of the observations in a given series are
below .
Exercise: The following table presents the male population of a certain region in Ethiopia.
Find a) all quartiles
b) The and decile and
c) and percentiles

Age groups (in years)No. of male population


0–5 2580
5 – 10 3737

23
10 – 15 4620
15 – 20 5200
20 – 25 7250
25 – 30 620
30 – 35 297
35 - 40 355

3.4 When to Use the Different Averages


Mean is appropriate if the data is quantitative and there is no extreme (abnormal)
observation(s). For the data having extreme value(s) (or for qualitative data having
ordinal measurement scale) it is better to use median as measure of central tendency. It is
largely used measure of central tendency in psychology, education and other social
sciences. On the other hand, mode is best measure of central tendency for qualitative data
with nominal scale of measurement. It can also be used as a quick measure of central
tendency for both qualitative and quantitative data.

Chapter Four
Measures of Dispersion (Variation)
4.1 Objectives of Measuring Variation
Variation (dispersion) is the scatter or spread of observations /values/ in a distribution
The average or central value is of little use unless the degree of variation, which
occurs about it, is given. If the scatter about the measure of central tendency is very
large, the average is not a typical value. Therefore, it is necessary to develop a
quantitative measure of the dispersion (or variation) of the values about the average.

Measures of variation are statistical measures, which provide ways of measuring the
extent to which the data are dispersed or spread out.

Measures of variation are needed for the following basic objectives.


 To judge the reliability of a measure of central tendency
 To compare two or more sets of data with regard to their variability
 To control variability itself like in quality control, body temperature, etc
 To make further statistical analysis or to facilitate the use of other statistical
measures.

Properties of a good measure of dispersion


A good measure of dispersion should:
 be rigidly defined by a mathematical formula,
 be simple to understand and easy to calculate,
 be unique,
 be fundamental of all observations in the series,
 not be affected by some extreme values existing in the series,
 have sampling stability property, and
 be capable of further algebraic treatment as well as further statistical analysis.

24
4.2 Absolute and Relative Measures of Dispersion
Measures of dispersion /variation may be either absolute or relative. Absolute
measures of dispersion are expressed in the same unit of measurement in which the
original data are given. These values may be used to compare the variation in two
distributions provided that the variables are in the same units and of the same average
size.

In case the two sets of data are expressed in different units, however, such as quintals
of sugar versus tones of sugarcane or if the average sizes are very different such as
manager’s salary versus worker’s salary, the absolute measures of dispersion are not
comparable. In such cases measures of relative dispersion should be used.

A measure of relative dispersion is the ratio of a measure of absolute dispersion to an


appropriate measure of central tendency. It is sometimes called coefficient of
dispersion because the word “coefficient” represents a pure number (that is
independent of any unit of measurement). Note also that the value of a relative
dispersion is unit less quantity.

4.3 Types of Measures of Dispersion


4.3.1 The Range and Relative Range
Range (R) is defined as the difference between the largest and the smallest
observation in a given set of data. That is, where xmax and xmin are the
largest and the smallest observations in the series respectively.

In case grouped data, range is found by taking the difference between the class mark
of the last class and that of the first class. That is, where and
are the class marks of the last class and that of the first class respectively.

A relative range (RR), also known as coefficient of range, is given by

Properties of Range and Relative Range


 Range and relative range are easy to calculate and simple to understand.
 Both cannot be computed for grouped data with open ended classes.
 They do not tell us anything about the distribution of values in the series.
Example 1: Find the range and relative range for the monthly salary of ten workers in
a certain paint factory given below.

462 480 534 624 498 552 606 588 516 570

Solution:

25
Example 2: Find the values of the range and relative range for the following
frequency distribution: which shows the distribution of the maximum loads supported
by a certain number of cables.
Maximum load(in kilo-Newton) Number of cables
93 – 97 2
98 – 102 5
103 – 107 12
108 – 112 17
113 – 117 14
118 – 122 6
123 – 127 3
128 – 132 1

Solution:

4.3.2 The Mean Deviation and Coefficient of Mean Deviation


The mean deviation (MD) measures the average deviation of a set of observations
about their central value, generally the mean or the median, ignoring the plus/minus
sign of the deviations.

The mean deviation of a sample of n observations is given as

Where A is a central measure (the mean or the median)


In case of grouped data, the formula for MD becomes
Where is the class mark of the class, is the

frequency of the class and .

 The mean deviation about the arithmetic mean is, therefore, given by
for ungrouped data

for grouped frequency distribution; where is the

class mark of the class, is the frequency of the class and

 The mean deviation about the median is also given by


for ungrouped data

26
for grouped frequency distribution; where is the

class mark of the class, is the frequency of the class and .

The coefficient of mean deviation (CMD) is the ratio of the mean deviation of the
observations to their appropriate measure of central tendency: the arithmetic mean or
the median.
In general, where A is a measure of central tendency: the arithmetic mean
or the median.
That is, CMD about the arithmetic mean is given by where MD is the
mean deviation calculated about the arithmetic mean. On the other hand CMD about
the median is given by in which case MD is calculated about the median
of the observations.

Properties of Mean Deviation and coefficient of mean deviation


- It is easy to understand and compute
- It is based on all observations
- It is not affected very much by the values of extreme value(s).
- It is not capable of further mathematical treatments and it is not a very
accurate measure of dispersion.

4.3.3 The Variance, the Standard Deviation and Coefficient of Variation


The Variance
Variance is the arithmetic mean of the square of the deviation of observations from
their arithmetic mean.
 Population Variance ( )
For ungrouped data

Where is the population

arithmetic mean and N is the total number of observations in the population.


For grouped data

Where is the population

arithmetic mean, is the class mark of the class, is the frequency of the class
and .

 Sample Variance ( )
For ungrouped data

Where is the sample

arithmetic mean and n is the total number of observations in the sample.


For grouped data

27
Where is the sample

arithmetic mean, is the class mark of the class, is the frequency of the class
and .
The Standard Deviation
Standard deviation is the positive square root of the variance.
 Population Standard Deviation ( )
where is the population variance.

 Sample Standard Deviation ( )


where is the sample standard deviation.

Coefficient of Variation
The standard deviation is an absolute measure of dispersion. The corresponding
relative measure is known as the coefficient of variation (CV).

Coefficient of variation is used in such problems where we want to compare the


variability of two or more than two different series. Coefficient of variation is the
ratio of the standard deviation to the arithmetic mean, usually expressed in percent.
. Where S is the standard deviation of the observations.
A distribution having less coefficient of variation is said to be less variable or more
consistent or more uniform or more homogeneous.

Example: Last semester, the students of Biology and Chemistry Departments took
Stat 273 course. At the end of the semester, the following information was recorded.

Department Biology Chemistry


Mean score 79 64
Standard deviation 23 11

Compare the relative dispersions of the two departments’ scores using the appropriate
way.

Solution:
Biology Department Chemistry Department

Interpretation: Since the CV of Biology Department students is greater than that of


Chemistry Department students, we can say that there is more dispersion relative to
the mean in the distribution of Biology students’ scores compared with that of
Chemistry students.

28
Properties of the Variance and the Standard Deviation
Variance
 It removes most of the demerits or drawbacks of the measures of dispersion
discussed so far.
 Its unit is the square of the unit of measurement of values. For example, if the
variable is measured in kg, the unit of variance is kg2.
 It is calculated based on all the observations/data in the series.
 It gives more weight to extreme values and less to those which are near to the mean.

Standard Deviation
 It is considered to be the best measure of dispersion.
 [Demerits] If the values of two series have different unit of measurement, then we
can not compare their variability just by comparing the values of their respective
standard deviations.
 It is calculated based on all the observations/data in the series. Standard deviation is
capable of further algebraic treatment.
 Standard deviation is as such neither easy to calculate nor to understand.
 Similar to the variance, standard deviation gives more weight to extreme values and
less to those which are near to the mean.

The Standard Scores (Z-Scores)


A standard score is a measure that describes the relative position of a single score in
the entire distribution of scores in terms of the mean and standard deviation. It also
gives us the number of standard deviations a particular observation lie above or below
the mean.
Population standard score: where is the value of the observation, and are
the mean and standard deviation of the population respectively.
Sample standard score: where is the value of the observation, and are the
mean and standard deviation of the sample respectively.

Interpretation:

Example: Two sections were given an exam in a course. The average score was 72
with standard deviation of 6 for section 1 and 85 with standard deviation of 5 for
section 2. Student A from section 1 scored 84 and student B from section 2 scored 90.
Who performed better relative to his/her group?
Solution: Section 1: = 72, = 6 and score of student A from Section 1; = 84
Section 2: = 85, = 5 and score of student B from Section 2; = 90

Z-score of student A:

Z-score of student B:

29
From these two standard scores, we can conclude that student A has performed better
relative to his/her section students because his/her score is two standard deviations
above the mean score of selection 1 while the score of student B is only one standard
deviation above the mean score of section 2 students.

30
31

You might also like