Stat Chapter 3
Stat Chapter 3
UNIT 3
The measure of Central Tendency: Usually when two or more different data sets are to be
compared it is necessary to condense the data, but for comparison, the condensation of data set
into a frequency distribution and visual presentation is not enough. It is then necessary to
summarize the data set in a single value. Such a value usually somewhere in the center and
represent the entire data set and hence it is called a measure of central tendency or averages.
Measures of Central Tendency: refers to a single value that describes the characteristics of the
entire mass of data. It gives information about the location of the center of the distribution of
data values.
Central Tendency refers to the measures used to determine the center of distributions of data. It
is used to identify a single value that represents an entire data set at most. The major common
types of central tendency are mean, median, and mode. Each of these measures calculates the
location of the central point using a different method. The choice of measures of central tendency
depends on the types of statistical data used.
To determine a single value around which other values in the data concentrate
To facilitates comparison among sets of data
To summarize or reduce the size of data
1. Mean (Average)
Averages are statistical constants that enable us to comprehend in a single effect the significance
of the whole. It gives us an idea about the concentration of the values in the central part of the
distribution. Speaking an average of a statistical series is the value of the variable which is
representative of the entire distribution. Average refers to the central value of given statistical
data. The main objectives/purpose of the mean are the following
The main object (purpose) of the average is to give a bird’s eye view (summary) of the
statistical data.
The average removes all the unnecessary details of the data and gives a concise (to the
point or short) picture of the huge data under investigation.
Average is also of great use for the purpose of comparison (i.e., the comparison of two or
more groups in which the units of the variables are the same) and for the further analysis
of the data.
Averages are very useful for computing various other statistical measures such as
dispersion, skewness, kurtosis. Perquisites (desirable qualities) of a Good Average: An
average will be considered as good if:
It utilizes all the values given in the data
It is not much affected by the extreme values
It can be calculated in almost all cases
It can be used in further statistical analysis of the data
It should avoid giving misleading results
Rigidly defined (unique)
Based on all observations under investigation
Easily understood and simple to compute
Suitable for further mathematical treatment and it should be mathematically
defined
Little affected by fluctuations of sampling and not highly affected by extreme
values.
i. Arithmetic Mean
Arithmetic mean: is defined as the sum of the measurements of the items divided by the total
number of items.
Arithmetic Mean: Arithmetic mean is a number that is obtained by adding the values of all the
items of a series and dividing the total by the number of items.
It is usually denoted by ̅ .
Example: There are six classrooms in Future Generation Hope Kindergarten School of Ambo
town. The class sizes of each of these kindergartens are 26, 20, 25, 18, 20, and 23. A researcher
writing a report about schools in her town wants to come up with a figure to describe the typical
kindergarten class size in this town.
= = 132/6 = 22
Therefore, the average kindergarten class size in this school is 22.
Example: Calculate the arithmetic mean of the pulse rates (beats per minute) of eleven students:
60 60 71 68 71 72 71 76 72 80 80
∑
̅ = = = = 71
In this case, there are two 60’s, one 68, three 71’s, two 72’s, one 76, and two 80’s. The number
of times each number occurs is called its frequency and the frequency is usually denoted by f.
The information in the sentence above can be written in a table, as follows.
Value, xi 60 68 71 72 76 80 Total
Frequency, fi 2 1 3 2 1 2 11
The formula for the arithmetic mean for data of this type is
∑
̅= = ∑
The following frequency table gives the height (in inches) of 100 students in a college.
Example: In 2002/03, the average salaries of elementary school teachers in three cities were Birr
24, 000, 20,000, and 30,000. If there were 600,400 & 800 elementary school teachers. Find the
weighted average salary of all the elementary school teachers in the three cities
Example: A student’s final mark in Mathematics, Physics, Chemistry, and Biology are
respectively A, B, D, and C. If the respective credits received for these courses are 4, 4, 3, and 2,
determine the approximate average mark the student has got for the course.
̅ ̅ ̅
̅
This is a special case of the weighted mean. In this case, the sample sizes are the weights.
Example: In the previous year there were two sections taking Statistics course. At the end of the
semester, the two sections got average marks of 70 & 78. There were 45 and 50 students in each
section respectively. Find the mean mark for the entire student.
v. Harmonic Mean
It is a suitable measure of central tendency when the data pertains to speed, rate and time. The
harmonic of n values is defined as n divided by the sum of their reciprocal.
Harmonic mean for individual series: If , are n observations, then harmonic
mean can be represented by the following formula:
n
H .M
1 1 1
x1 x2 xn
Example: A car travels 25 miles at 25 mph, 25 miles at 50 mph, and 25 miles at 75 mph. Find
the harmonic mean of the three velocities.
Solution
H .M
n = = 40.9.
1 1 1
x1 x2 xn
Harmonic mean for discrete data arranged in FD:- If the data is arranged in the form of
frequency distribution
n
H .M m
, where n f k
f1 f 2 f
m k 1
x1 x 2 xm
Harmonic mean for continuous grouped FD: Whenever the frequency distribution are
grouped continuous, class marks of the class intervals are considered as and the above
formula can be used as
H.M. = where
∑
A.M is an appropriate average for all the situations where there are no extreme values in
the data
G.M is an appropriate average for calculating the average percent increase in sales,
population, production, etc. It is one of the best averages for the construction of index
numbers
H.M is an appropriate average for calculating the average rate of increase of profits of a
firm or finding the average speed of a journey or the average price at which articles are
sold
2. MEDIAN
Median is the midpoint of the values after they have been ordered from the smallest to the
largest Equivalently, the Median is a number that divides the data set into two equal parts, each
item in one part is no more than this number, and each item in another part is no less than this
number
Median is the value of that item in a series which divides the array into two equal parts, one consisting of all the values
less than it and the other consisting of all the values more than it. median is a positional average. The number of items
below it is equal to the number. The number of items below it is equal to the number of items above it. it occupies the
central position. thus, t he median is defined as the mid- value of the variants if the values are arranged in ascending or
descending order of their magnitude, t h e median is the middle value of the number of variants is odd, and an average of
two middle values if the number of variants is even
Median is the middle number in a sorted list of numbers. It is the value that separates the higher
half from the lower half of a data sample. In a data set, it may be thought of as the “middle”
value. Median is an appropriate average in a highly skewed distribution e.g. in the distribution of
wages, incomes
For example, in the data set [1, 2, 3, 6, 7, 8, 9], the median is 6, the fourth largest, and also the
fourth smallest, number in the sample. Therefore, in case the data set has an odd number of
values, the median is the center value. But, when there is an even number of values in a data set,
then the two middle needs to be added and divide by 2. For example, in the data set [1, 2, 3, 5,
6, 7, 8, 9], the median is 5.5. To determine the median value in a data set, the numbers must first
be sorted or arranged in order of magnitude. The median is less affected by outliers and skewed
data. This property makes it a better option than the mean as a measure of central tendency
In the case of continuous frequency distribution, the class corresponding to the cumulative
frequency just greater than N/2 is called the median class and the value of median is obtained
by the following formula:
Demerits of Median
It may not be representative value as it ignores extreme values
It can’t be determined precisely when its size falls between the two values
A slight change in the series may bring a drastic change in the median value
In case of an even number of observations or continuous series, the median is an
estimated values ether than any value in the series
It is not suitable for further mathematical treatment except its use in mean deviation
It is not useful in cases where large weights are to be given to extreme values.
3. THE MODE
Mode is defined as the most frequently occurring value in data. The mode is not attracted by
the extreme values in the data. Mode is the only measure of central tendency that can be used
for qualitative (nominal) data. Mode is an appropriate average in the case of qualitative data
e.g. the opinion of an average person; it is probably referring to the most frequently
expressed opinion which is the modal opinion
The mode in case of Ungrouped Data: “A value that occurs most frequently in a data is called
mode” OR “if two or more values occur the same number of times but most frequently than
the other values, the there is more than one whole” “If two or more Values occur the Same
number of times but most frequently than the other values, then there is more than one mode”
. The data having one mode is called uni-modal distribution. The data having two modes is
called bimodal distribution. The data has more than two modes is called multi-modal
distribution. The mode in case of Discrete Grouped Data: “A value which has the largest
frequency in a set of data is called mode” Mode in case of Continuous Grouped Data: In case
of continuous grouped data, the mode would lie in the class that carries the highest
frequency. This class is called the modal class. The formula used to compute the value of
mode is given below: Numerical examples of Mode for ungrouped and group
Example: Find the mode for the following exam result (10%) of 15 students
3,8,6,5,8,7,8,6,7,4,7,5,7,9,
( )
is the difference between the frequency of modal class and that of the preceding class
is the difference between the frequency of the modal class and that of the following class
Example 3.19:
Demerits of mode
1. Mode is ill-defined. It is not always possible to find a clearly defined mode, in some
cases we may come across a distribution having two modes and it is called bi-modal. If a
distribution has more than two modes it is said to be multimodal
2. It is not based upon all the observations
3. It is not capable of further mathematical treatment
4. As compared with the mean, the mode is affected to a greater extent, by fluctuations of
sampling
i. QUARTILES
Are values that divide the data set into approximately four equal parts, denoted by
. The first quartile ( ) is also called the lower quartile and the third quartile ( ) is
the upper quartile. The second quartile ( ) is the median.
Quartiles are values, which divide the ordered data into 4 equal parts. Hence there are three
quartiles
The first quartile Q1 is the value that is the first quarter of the given ordered data.
The second quartile Q2 is the value that divides the given ordered data into two equal
parts
The third quartile Q3 is the value that is the third quarter of the given ordered data
Quartiles are the measurements that divide the series into 4 equal parts. The median is the 2nd
quartile. The first quartile (Q1) is the value of the item, which divides the lower half of the
distribution into two equal parts. The third quartile (Q3) is the value or the item that divided
3
the upper half of the distribution into two equal parts. That is it is the value of the item
4
in the series
For raw (ungrouped) data, first, arrange the n observations in increasing order of
Magnitude. Then the ith quartile is given by
th
i
Qi n 1 Value of the ordered data
4
Let be n ordered observations. The ith quartile is the value of the item corresponding with the
[i(n+1)/4]th position, i = 1, 2, 3.
That is, after arranging the data in ascending order, Q1, Q2, & Q3 are, obtained by:
( ) , ( ) and ( )
Arranged in a frequency distribution this case also, we will follow the same procedure as the
median. That is, we construct the less than cumulative frequency distribution and apply the formula
of quartile for Individual series.
For continuous data, use the following formula. Where i = 1, 2, 3, and L, w,fQi and CF are defined in the
same way as the median.
Q1 = L + ( ), Q2 = L + ( ) Q3 = L + ( )
The class under question is the one including (ixn/4)th value. That is, the class with the minimum
cumulative frequency greater than or equal to (ixn/4) th is the class of the ith quartile.
i. DECILES
Are values dividing the data approximately into ten equal parts, denoted by .
Deciles for Individual series:
Let be n ordered observations. The ith decile is the value of the item corresponding with the
th
[i(n+1)/10] position, i = 1, 2, . . . ,9.
That is, after arranging the data in ascending order, D1, D2, . . . & D9 are, obtained by:
( ) , ( ) . . . and ( ) .
Arranged in a frequency distribution this case also, we will follow the same procedure as the
median. That is, we construct the less than cumulative frequency distribution and apply the formula
of deciles for individual series.
Deciles for continuous data: Apply the following formula and follow the procedures of
quartile for continuous data.
( ) i = 1, 2,...,9 . Then
Define the symbols in similar ways as we did in the case of quartiles for continuous data.
ii. PERCENTILES
Are values that divide the data approximately into one hundred equal parts, and denoted by
Let be n ordered observations. The ith percentile is the value of the item
corresponding with the [i(n+1)/100]th position, i = 1, 2, . . . ,99.
That is, after arranging the data in ascending order, P1, P2, . . . & P99 is, obtained by:
( ) , ( ) . . . and ( ) .
Arranged in a frequency distribution this case also, we will follow the same procedure as the
median. That is, we construct the less than cumulative frequency distribution and apply the formula
of percentile for individual series.
Define the symbols in similar ways as we did in the case of quartiles or deciles for continuous
data.
Interpretations
is the value below which ( i × 25) percent of the observations in the series are found
(where i = 1, 2,3). For instance, means the value below which 75 percent of
observations in the given series are found
is the value below which ( i ×10) percent of the observations in the series are found
(where i = 1, 2,...,9 ). For instance, is the value below which 40 percent of the values
are found in the series
is the value below which i percent of the total observations are found (where i = 1, 2,
3...,99). For example, 60 percent of the observations in a given series are below .
Example: Calculate , for the following tables.
x 10 11 12 13 14 15 16 17 18
f 2 8 25 48 65 40 20 9 2
Solution: The given data is measured and is arranged in increasing order. So we need to
construct only the cumulative frequency table before calculating the required values.
x 10 11 12 13 14 15 16 17 18
f 2 8 25 48 65 40 20 9 2
Cum. Freq. 2 10 35 83 148 188 208 217 219
The total number of observations is 219 which is odd. Clearly then the median is 14 because
̃= = value = 110th value = 14
( ) =( ) = 55th value = 13
( ) =( ) = 110th value = 14 = ̃
( ) =( ) = 165th value = 15
( ) =( ) = 88th value = 14
( ) =( ) = 198th value = 16
( ) =( ) = 88th value = 14
Q1 = L + ( ) = 10.5 + = 19.1
D4 Measure of (4n/10)th value = 20th value which lies in group 20.5 – 30.5.
D4 = L + ( ) = 20.5 + = 29.1
P7 Measure of (7n/100)th value = 3.5th value which lies in group 10.5 – 20.5
P7 = L + ( ) = 10.5 + = 11.
Introduction
The term dispersion is generally used in two senses. Firstly, dispersion refers to the variations of
the items among themselves. If the value of all the items of a series is the same, there will be no
variation among different items of a series. Secondly, dispersion refers to the variation of the
items around an average. If the difference between the value of items and the average is large,
the dispersion will be high and on the other hand, if the difference between the value of the items
and averaging is small, the dispersion will below. Thus, dispersion is defined as the scatteredness
or spreads of the individual items in a given series.
Similar to mean deviation, the variance is also based on all observations in a set of data. But the
variance is the average of squared deviations from the mean. Recall that the sum of squared
deviations is minimum only when taken from the mean. Squared deviations are mathematically
manipulated than absolute deviations. Thus, if we averaged the squared deviations from the
mean and take the square root of the result (to compensate for the fact that the deviations were
squared), we obtain the standard deviation. This overcomes the limitation of the mean deviation.
a. Population Variance ( )
If we divide the squared variation by the number of values in the population, we get something
called the population variance. This variance is the "average squared deviation from the mean".
For Ungrouped Data
∑
[∑ ] , where is the population arithmetic mean and N
where is the population arithmetic mean, is the value or class mark of the ith class, is the
frequency of the ith class and N=∑
b. Sample Variance ( )
To derive sample variance, replace sample means in the position of the population mean and
drive for the value. However, one of the major uses of statistics is to estimate the corresponding
parameter. This formula has the problem that the estimated value isn't the same as the parameter.
To offset this, the sum of the squares of the deviations is divided by one less than the sample
size.
For Ungrouped Data
∑ ̅
[∑ ̅ ]
where ̅ is the sample arithmetic mean and n is the total number of observations in the sample.
If the values xi have frequencies fi (i=1,2,…,m), then the sample variance is given by:
1 m
̅ ] or S fi xi x
2 2
∑ ̅
[∑
n 1 i 1
∑ ̅
[∑ ̅ ] where ̅ is the sample arithmetic mean, is the value or
class mark of the ith class, is the frequency of the ith class and n=∑ .
2. The Standard Deviation
There is a problem with variances. Recall that the deviations were squared. That means that the
units were also squared. To get the units back the same as the original data values, the square
root must be taken.
Population Standard Deviation ( )
√ where is the population variance.
Sample Standard Deviation ( S )
√ where is the sample variance.
Example: Find the sample variance and standard deviation for frequency distribution
of height in cms of students in a AU given below.
Heights in cms 150 152 154 156 158 160 162 164 166
Number of students 28 40 52 100 60 48 32 20 7
[∑ ̅ ]
= [ ( ) ]
Example: Calculate the sample variance and standard deviation of the blood glucose
level, in milligrams per deciliter, for 60 patients shown below.
Frequency 9 5 12 17 7 6 4
Solution: In a continuous F.D., xi is the class mark representing the ith class.
Class limit xi fi f i xi 2
f i xi
55 – 63 59 9 531 31329
64 – 72 68 5 340 23120
73 – 81 77 12 924 71148
82 – 90 86 17 1462 125732
91 – 99 95 7 665 63175
100 – 108 104 6 624 64896
[∑ ̅ ]= [ ]
√ = 15.48
3. Range
Range(R) difference between the largest (L) and the smallest value (S) in a distribution Thus
Range (R) = L – S
In the case of continuous series Range is just the difference between the upper limit of the highest
class and the lower limit of the lowest class
Range: Evaluation
The range is very simple to understand and easy to calculate. However, it is not based on all the
observations of the distribution and is unduly affected by the extreme values. Any change in the
data not related to minimum and maximum values will not affect the range. It cannot be calculated
for open-ended frequency distribution.
Example: The amount spent (in Birr `) by the group of 10 students in the school canteen is as
follows: 110, 117, 129, 197, 190, 100, 100, 178, 255, 790.
Find the range and the co-efficient of the range.
Example 2: Find the range and it’s co-efficient from the following data
Size 10 – 20 20 – 30 30 – 40 40 – 50 50 -100
Frequency 2 3 5 4 2
Solution: R = L – S = 100 – 10 = 90
b. Quartile Deviation
It is based on the lower quartile Q1 and the upper quartile Q3. The difference Q3 – Q1 is called
the inter-quartile range. The difference Q3 – Q1 divided by 2 is called semi-inter-quartile range or
the quartile deviation. Quartile deviation, also called semi-inter-quartile range, is half of the
difference between the upper and lower quartile. That is, half of the inter-quartile range. Its
formula is
Merits of QD
Demerits of QD
It is not based on all the items (it ignores 50% items, i.e., the first 25% and the last
25%).
It is greatly influenced by sampling fluctuations.
It is not amenable to algebraic manipulations.
Example: The wheat production (in Kg) of 20 acres is given as: 1120, 1240,1320, 1040, 1080,
1200, 1440, 1360, 1680, 1730, 1785, 1342,1960, 1880, 1755,1720, 1600, 1470, 1750, and 1885.
Based this data find the quartile deviation and coefficient of quartile deviation.
Solution: Arrange the observations in ascending order: 1040, 1080, 1120, 1200,
1240, 1320, 1342, 1360, 1440, 1470, 1600, 1680, 1720, 1730, 1750, 1755, 1785, 1880, 1885,
1960.
= 5th item + 0.25(6th item – 5th item) = 1750 + 0.75 (1755 – 1750)
Frequency 14 24 38 20 4
( )
( ) ( )
( ) ( )