Charpter 5 - Descriptive Analysis
Charpter 5 - Descriptive Analysis
20
Frequency
10
1.Computed averages
a) The arithmetic mean
b) The geometric mean
c) The harmonic mean
2. Positional averages
a) The median
b) The mode
Averages
1.Computed averages
a)The arithmetic mean
b)The geometric mean
c)The harmonic mean
Averages
1.Positional averages
a)The median
b)The mode
Population and sample sizes
• If the population data are involved, we let denote
the number of observations and the observations
themselves are labeled as
• With the sample data, is used to denote the
number of observations which are labeled as
The Arithmetic mean
• The arithmetic mean or simply the mean is the most commonly used measure of central tendency.
• In everyday use, however this term is erroneously regarded as synonymous with the term average. But the mean is just one of
the many averages.
Ungrouped data
The mean for ungrouped data is computed by adding all the numerical observations and dividing the sum by the number of
observations in a given set.
• Suppose there are population observations with numerical values denoted by say . The population mean which is customarily
denoted by the Greek letter (read as ) is given as
The following are the lengths (in cm) of a sample of six garment blanks chosen at random from
a large batch of similar blanks: 54.5, 55.0, 55.7, 51.8, 54.2, 52.4.
What is the mean length of the sample of garments?
Grouped data
• Suppose that we have data grouped into classes, with frequencies . Let the mid points (class
mark) of these classes be , respectively.
• For a population of observations, the mean is estimated as
Values Frequency
140-150 17
151-160 29
161-170 42
171-180 72
181-190 84
191-200 107
201-210 49
211-220 34
221-230 31
231-240 16
241-250 12
The geometric mean
While the arithmetic mean is obtained by defining the sum of the set of values and then dividing by
say the geometric mean is the root of the product of a set of values. If we let symbolize the
geometric mean, then
( 𝑥 1 ∗ 𝑥2 ∗ 𝑥 3 ∗ … … … … . ∗ 𝑥 𝑛 )
𝐺𝑀 =
𝑛
The geometric mean is used mainly for averaging series of ratios or percentages. Geometric mean
cannot be computed if one of the values in the data set is negative and it is zero.
Ungrouped data
It is much easier to work in logarithms when computing geometric mean. The formula is;
From logarithms
Geometric mean - Example
Solution
𝑛
𝐻𝑀 =
1 1 1 1
+ + +… … … +
𝑥1 𝑥2 𝑥3 𝑥𝑛
Ungrouped data
• Find the harmonic mean of
Revision question
Find the arithmetic mean, geometric mean, and harmonic mean of the following set of values of X
Exercises for grouped data
Compute the Arithmetic mean, Geometric mean, harmonic mean
Values Frequency
140-150 17
151-160 29
161-170 42
171-180 72
181-190 84
191-200 107
201-210 49
211-220 34
221-230 31
231-240 16
241-250 12
The median
The median conveys the notion of being the middle most value, dividing the distribution into two halves. Exactly 50% of the values
will lie on each side of the median. The median is said to be a positional average since it is located rather than computed.
Ungrouped data
The median of a set of observations is defined as the middle value when the observations are arranged in order of magnitudes
from the smallest to the largest or vice versa.
Let be ordered observations.
Arrange the data in ascending order.
Determine n, the number of elements in the set. Is n “odd” or “even”
Find the value of the median by either getting the middle value (odd) or adding the two middle values and dividing the sum by
two.
Example
Suppose we want to find the median expenditures of customers during the Christmas eve shopping at Target Supermarket, served by
the shop attendant at counter No. TS01. The following were her records (in Australian Dollars)
350
Where
is the lower-class boundary of the median class
is the cumulative frequency of the class interval preceding the median class
is the frequency of the median class
is the class size/width of the median class
Grouped data:
Grade Frequency
40-49 5
50-59 18
60-69 27
70-79 15
80-89 6
Solution
40-49 39.5-49.5 5 5
50-59 49.5-59.5 18 23
60-69 59.5-69.5 27 50
70-79 69.5-79.5 15 65
80-89 79.5-89.5 6 71
Solution
Since
The median class is
From the formula of median
,,
Then applying the Formula, we get :
Exercises for grouped data
Values Frequency
140-150 17
151-160 29
161-170 42
171-180 72
181-190 84
191-200 107
201-210 49
211-220 34
221-230 31
231-240 16
241-250 12
The mode
• The word mode in French means “fashionable” and in the context of a frequency distribution, it means the
most common value.
• The mode therefore is defined as the value that occurs more frequently than any other in the data set.
•
A set of data may have a single mode, in which case it is said to be
multimodal.
Modes
Ungrouped data
Example
Find the mode for the following data sets;
Solution
a) Since is the value that occurs most often ( times) is the mode.
b) This set of values has two modes, . They both occur four times. It is an example of a bimodal
case.
c) This set of values has no mode because there is no one particular value that occurs more often
than any other
Mode for Grouped data
In grouped data:
• First, determine the model class, that is, the class with the highest frequency.
• Secondly, estimate the mode using the following formula;
• Where;
• = The lower-class boundary of the modal class
• =
• =
• = The size of the modal class
• = Frequency of the class preceding the modal class
• = Frequency of the class succeeding the modal class
• = Frequency of the modal class.
Example of the mode
Calculate the modal age for the age distribution of 228 patients below
= 57- 50 = 7
= 57 – 48 = 9
W = (30-34)+1 = 5
• Mode = 31.7
Exercises for grouped data
Compute the Mode
Values Frequency
140-150 17
151-160 29
161-170 42
171-180 72
181-190 84
191-200 107
201-210 49
211-220 34
221-230 31
231-240 16
241-250 12
Relationship between the mean, median and mode
For a symmetric frequency distribution which is bell shaped, the mean, median and mode coincide.
Relationship between the mean, median and mode
• But for skewed distribution with a peak and a long tail, these measures do not coincide.
• Starting from the peak of the distribution and moving towards the longer tail, they appear in the following
order : mode, median and mean.
Certain numerical relationships exist among the averages
For any series, except one whose observations are of identical value, the arithmetic mean is
always greater than the geometric mean which in turn is greater than the harmonic mean.
For a symmetrical and unimodal distribution
Of all the three, the mean is the most useful measure of the central tendency. It always exists, is
unique, reliable and takes into account all observations and lastly, it is easy to manipulate
Characteristics of measures of central tendency
The arithmetic mean
The arithmetic mean uses all observations in the data set, and it is therefore affected by extreme values
particularly if the extreme values fall to the same side of the distribution. When this occurs, the mean may be less
representative of the set than any of the other averages.
The arithmetic mean is unique, and it is always determined if the individual values of the variables are available.
As a computed average, the arithmetic mean tends itself to further algebraic manipulations.
The arithmetic mean is typical in the sense that it is the centre of gravity, balancing the values on either side of it.
Geometric mean
It is affected by all items in the series, but it gives less weight to extremely high values than does the arithmetic mean.
It is strictly determined for positive values but cannot be used to average negative values or values with a zero term.
It is adapted to average rates of change, ratios between measures and the ratios of price change.
It is also capable of algebraic manipulation.
Harmonic mean
It is also affected by all observations. However, since the reciprocals are averaged, it gives more weight to the smaller
values. This is just the opposite of the mean.
It is capable of algebraic manipulation.
It is adapted to average time rates and price movements. It is also useful when the observations are expressed inversely to
what is required in the average.
Characteristics of measures of central tendency
Median
It is a positional average that is affected by the number of items but not by the value of each item.
The extreme deviations from the central part of the distribution affect the median much less than the case for the
mean.
The median strictly speaking is indeterminate for an even number of cases, although by general agreement it is the
mean of the two central values of the data set.
The median unlike the mean does not lend itself to algebraic treatment.
It is meaningless for completely qualitative data but meaningful as long as data can be ranked, such as grades A, B,
C,D and E. The median is the most suitable average to describe observations that are scored rather than computed
or measured.
Mode
The modal value is determined by the items at the point of greatest concentration and is not affected by the
remaining values of the data set.
The true mode is difficult to compute but readily located from the frequency distribution.
The mode does not lend itself to algebraic manipulation.
The mode is unaffected by extreme values and may not exist for a given data set.
The mode is meaningless unless the distribution includes a large number of observations and possesses a distinct
central tendency.
Measures of Dispersion
Revision question
Find the arithmetic mean, geometric mean, and harmonic mean of the following set of values of X
Exercises for grouped data
Compute the mean absolute deviation
Values Frequency
140-150 17
151-160 29
161-170 42
171-180 72
181-190 84
191-200 107
201-210 49
211-220 34
221-230 31
231-240 16
241-250 12
Measures of Dispersion,
Histogram
Histogram
Continuous Data
20
Frequency
10
• An average is a single value adapted to represent the central tendency of a series, is indeed a very useful and
powerful measure.
• However, the use of a single value to describe a distribution conceals many important facts. Decision making
often demands the revelation of these concealed characteristics of the distribution.
• For one thing, not all observations in a series are of the same value as the derived average. Almost without
exception, the items included in a distribution always depart from the central value, although the degree of
departure varies from one series to another.
• Thus, a measure of the dispersion or variation is needed in order to give a more complete description of the
chief characteristics of a distribution or to make possible effective comparison of two or more distributions.
• For example, a company manufacturing say electric bulbs will be interested not only in the average life of the
bulbs but also how consistent the performance of the bulbs is.
• That is, a mean life of 1000 hours will not be satisfactory even if this is realised if, in fact, there is a very high
proportion of bulbs that only last up to 300 hours.
Measures of Dispersion, Skewness and Kurtosis - Introduction
• A second consideration is that distribution shapes differ from one data set to another. Some a symmetrical,
others are not. Hence, to describe a distribution we also need a measure of symmetry or asymmetry - a
description of the balance or lack of balance on both sides of the central tendency. The descriptive statistic for
this characteristic is called the measure of skewness.
• Finally, there are differences in the of the degree of peakedness among the different distributions. This
property is called kurtosis. To measure kurtosis is to define the pattern of scatter of observations among the
classes near the central value, as compared with scatter of observations near both ends of the distribution.
• As in the central tendency, there are many different summary measures for dispersion, namely;
1. The Range
2. The Interquartile range
3. The Index of dispersion
4. The mean absolute deviation
5. The variance and standard deviation
6. The coefficient of variation
The range
The range is defined as the difference between the largest and the smallest values in the data set. Let
stand for the range.
Then where are the largest and smallest values of respectively.
In grouped data case, the range is the difference between the lower-class boundary of the lowest class
interval and the upper-class boundary of the highest-class interval.
The range is the simplest measure of dispersion and its easy to determine.
Its chief disadvantage is that it uses only two values in the data set and therefore ignores the way the
remaining data values vary in the data set.
Example
Find the range of the following values
Now
The range
• The range though a simple may be used quite fruitfully as a measure of dispersion for many
purposes. It is perhaps most useful when one wants to know only the extent of the extreme
dispersion under ‘ordinary’ conditions. If either the largest or smallest item is unusual, the
range reveals nothing about the ordinary distribution of the items. The range can also be
used to advantage when the same sample size is used repeatedly, as is often true in
manufacturing quality control. In this case, comparisons between ranges are not affected by
differences in sample size, so it is easy to see whether dispersion is getting worse, staying the
same, or getting better. Finally, the range is the only measure of dispersion that people
without statistical training can immediately understand.
• The range however has some serious defects. It can be unduly influenced by one unusual
value in the sample. Also, the range is in no way a measure of the scatter of the intervening
items relative to the typical value. Finally, the range is highly sensitive to the size of a sample.
The range tends to increase, though not proportionately, as the size of the sample increases.
For this reason we cannot interpret the range properly without knowing the number of
observations included in the data.
Interquartile range
• The range as noted above is subject to the chance of erratic changes in the extreme items and it fails to
take into account the scatter within the range. To overcome these limitations, at least partially, one of
the measures suggested is called interquartile range.
• There are two sets of fractiles or percentiles that are used most frequently.
• The first is
• These three fractiles divide the data the whole distribution into four equal parts or four quarters, thus they are
often referred to as quartiles and are denoted as . Here is the percentile and is defined as that value of a
variable for which of the values are less than or equal to it and of the values of the variables are greater than
or equal to it. are divided in a similar manner.
• The second set of frequently used fractiles is
• These nine fractiles divide the range of the variable into equal parts.
• Thus, they are referred to as deciles and are denoted as . Obviously, the first decile, equal to and it is the value
of the variable such that of the values are less than or equal to it and of the values are greater or equal to it.
The other deciles can be defined in a similar manner.
Interquartile range
As mentioned earlier the range is often affected by extreme values. One way of getting rid of this problem is to cut off the
top and bottom quarters by considering only the quartiles. We then get the range of what is left. This range is what we call
the Interquartile range.
To determine the Interquartile range, we begin by locating the two quartiles and . The process of finding quartile values
parallels that of locating the median. To obtain quartiles from an array of a sample we simply first locate the rank of items by
observing that;
= is the item and
is the item.
We then read off these values from the array. If fractional values occur, we make a linear interpolation between the values
corresponding to the two observations within which the fraction falls. The interquartile range denoted by is simply the
difference between and , that is
A lower interquartile range indicates a small variation among the central of the items and a high interquartile range value
means that the variation among the central of the items is large.
Interquartile range
Algebraic interpolations for quartile values from grouped data can be made by the same
principle or procedure as that for the median. Since the median is identical with the second
quartile.
+
+
Where and are the lower class boundaries of the first and third quartile, respectively; is the
total number of observations; is the width of the class containing the quartile of interest; and
are the first and third quartile class frequencies respectively; and are the cumulative frequencies
before the first quartile class and the third quartile class, respectively
Example
Class Class mark Frequency CF
2-3 2.5 2 2
4-5 4.5 3 5
6-7 6.5 5 10
8-9 8.5 9 19
10-11 10.5 7 26
12-13 12.5 4 30
14-15 14.5 1 31
16-17 16.5 1 32
Total 32
The index of dispersion is suggested as a measure of variation for nominal or ordinal variables
suggested by Hammond K. R. and Householder J.E (1962) in their text Knopf, New York, pp. 136-142.
The index is simply a ratio of the number of different pairs that could be made out of out of the data
at hand, compared with the maximum number of unique pairings that could be created if cases were
evenly spread over all available categories. Denoted by , the index of dispersion is expressed as
Where is the number of scores, is the number of categories of the variable into which data might be
classified, is the frequency of the cases in the category. If all scores were in a single category of a
variable that has several possible categories, then there is a maximum concentration or minimum
variability, and would equal to On the other hand, if cases were evenly distributed among the
possible categories, there would be maximum variability, and the numerator and denominator of the
ratio would be the same, and would equal to therefore varies from and and is a useful measure of
variation for nominal or ordinal variables.
Example
Satisfaction with work
1 Very satisfied 2 4
2 Satisfied 7 49
3 Neutral 3 9
4 Dissatisfied 9 81
5 Very dissatisfied 5 25
6 None of the above 4 16
Total 30
184
is computed as
=
A possible application of the index is as a measure of the concept ‘division of labour’ which refers to
the difference or variability among individuals in their sustenance activity. The more evenly spread
they are among the different possible occupations, the grater the division of labour. The amount of
division of labour could be expressed for different groups of people (for example, men and women in
Mean absolute deviation
The mean of the absolute deviations from the mean is called the mean absolute
deviation or simply average deviation and is obtained by dividing the sum of the
absolute deviations by the number of the deviations which is the same as the
number of observations. Thus, the mean absolute deviation from sample
observations is
=
Where is the mean of the variable and signifies that all differences between the
individual value and the mean are in absolute terms, that is only the magnitude
of the deviations and not the sign of the deviations are considered.
Example
A secretary typed five one page letters. The number of minutes spent on these letters were
=
=
=
The mean absolute deviation is useful in situations where no elaborate analysis is required. It is
introduced here as a logical stepping stone to the variance, which is a superior measure of
dispersion.
Variance and standard deviation
Ungrouped data
The variance
The variance of a set of observations is defined as the mean of the squares of the ‘deviations of the individual observations from their
mean’. It tells us how the observations vary or are spread out about the mean.
Denoting the population variance of observations with mean by
The population standard deviation is the positive square root of the variance.
The sample variance denoted by is defined as
Variance =
An equivalent formular more convenient for computation is
The sample standard deviation, , is the positive square root of the variance. Note that is the divisor of . The accounts for the
Ungrouped data
| |
𝑛 𝑛 2
𝑛 ∑ 𝑥𝑖 − 2
∑ 𝑥𝑖
2 𝑖=1 𝑖=1
𝜎 =
𝑛 ( 𝑛 −1 )
Example
Find the variance, standard deviation, coefficient of skewness and coefficient of kurtosis using
sample moment about the mean of
Solution 1 2 4
2 15 225
3 7 49
4 29 841
5 11 121
6 33 1089
7 5 25
8 15 225
9 3 9
10 23 529
Variance =
Variance
Where are the midpoints or class mark; the number of classes and is the frequency.
The standard deviation is then given by the positive square root of the variance obtained from the formular above.
Interpreting variance or standard deviation
Large values of standard deviation and variances indicate more dispersion of the values of about the mean and small values
indicate that the values of are clustered about the mean.
Grouped data
| |
𝐾 𝐾 2
𝑛 ∑ 𝑓 𝑖 𝑚 − ∑ 𝑓 𝑖 𝑚𝑖
𝑖
2
2 𝑖=1 𝑖 =1
𝜎 =
𝑛 (𝑛 − 1 )
Example
Compute the sample variance, standard deviation, cofficient of skewness and coefficient of
kurtosis using sample moment about the mean from the grouped data
Class interval Class mark, m
Solution
Frequency, f
0-4 2 6 12 4 24
5-9 7 25 175 49 1225
10-14 12 11 132 144 1584
15-19 17 7 119 289 2023
20-25 22 1 22 484 484
Total 50 460 5340
and
Relative dispersion and coefficient of variation
So far we have been dealing with absolute measures of dispersion in a distribution. However, there are times when the problem is
to compare the scatter in one distribution with that of another. If the items I one distribution are different in magnitudes as well as
units of measurement from the items in another distribution, it becomes difficult to compare the degree of scatter in the two. The
most common method of comparing amounts of different magnitudes and units is to reduce them to a comparable basis first.
Quite often, a standard deviation is expressed as percentage of the mean of the data from which it was computed. This new value
which is now unitless is called the coefficient of variation and denoting it by , we have
Example
Suppose that Mr. Otto, a Nakasero butcher, keeps records on monthly sales of his high quality beef and summarises his records as
follows;
From the table above, the coefficients of variation indicate that, on relative terms, there was greater variation in the prices of beef
Measures of skewness
• In a given data set, when the plotted frequency distribution curve is not symmetrical, that is, not bell-
shaped, we say it is skewed. The measure of skewness indicates the degree to which the data set
deviates from the symmetry and also indicates the direction in which it is skewed.
• Skewness is brought about by the presence of extreme values at one side or in one of the tails of the
distribution, thereby elongating that tail. When the extreme values occur in the upper or right tail, the
distribution is said to be negatively skewed. When the extreme values occur in the lower or left tail,
the distribution is positively skewed.
• As discussed in the previous chapter, the mean is the measure most affected by the presence of
extreme values in one tail of a distribution. It is pulled substantially in the direction of the extreme
values. The mode is unaffected by the extreme values while the median, which is affected by the
number but not the value of extreme values, is pulled in the direction of extreme values but not as far
as the mean. The median moves about two-thirds as far as the mean in the direction of extreme
values.
• Several measures of skewness are available but we shall look at only two;
1. Pearsonian measure of skewness
2. An alternative measure of skewness involves using sample moments about the mean
Pearsonian measure of skewness
• This measure is based on the relationship between the mean, the median and the mode. Denoting the Pearsonian
coefficient of skewness by it is defined as
• Theoretically, both should yield the same result because of the relationship
• However, owing to the unstable measure of the mode, the formula involving the mean and median is usually preferred in
actual application.
• is nearly always within the limits of , even though theoretically it could range in value from
• When the distribution is symmetrical, the mean, median and mode coincide so
• For positively skewed distribution, the mean is greater than the median and is a positive value. The inverse is true for
negatively skewed distributions.
• Sample moments as measures of skewness
Sample moments about the mean as measures of skewness
An alternative measure of skewness involves using sample moments about the mean. The general formular for say moment
about the mean, which we denote by is
In particular
1. First moment about the mean (mean absolue deviation) is always zero, that is,
Note that the third moment about the mean can be positive or negative.
For computational purposes, we can use an equivalent formula for the numerator in the above equation given as
=
mean
Ungrouped data
𝑛
∑ (𝑥 − 𝑥)
3
𝑖=1
𝑀 3 ( 𝑥 )=
𝑛
Measure of skeweness
=
Sample moments as measures of skewness
In case of grouped data the third moment about the mean is
Where is the number of classes, is the frequency, and is the midpoint or class mark.
Assuming a moderately skewed distribution, a measure of skewness which is independent of the scale is given by
In conclusion, a few observations may be mentioned. It is quiet common to encounter positively skewed distributions in
economic and business data, particularly in production and price series, which can only be as small as zero but can be
indefinitely large. It is believed that positive skewness is produced by multiplicative force. For instance, income distribution is
usually positively skewed because it is affected by a large number of factors such as education, sex, family background, and so
forth, which can be thought of as combining multiplicatively instead of additively. Negatively skewed distributions are quite
rare and it is often difficult to furnish a rational explanation for their existence.
Grouped data
𝐾
∑ 𝑓 𝑖 ( 𝑚𝑖 − 𝑥 )
3
𝑖=1
𝑀 3 ( 𝑥 )= 𝐾
∑ 𝑓 𝑖
𝑖 =1
Grouped data
Measures of Kurtosis
• The shape of the frequency distribution is primarily indicated by its skewness and kurtosis. The term kurtosis
describes the curvature or sharp-pointedness of the curve of the frequency distribution.
• Kurtosis measures from fractiles
• Using the fractiles, the coefficient of kurtosis denoted by is defined as a ratio of the semi-interquartile range
– that is, half of the value of the interquartile range – and the interdecile range, namely
𝟏
𝟐
( 𝑸𝟑 − 𝑸 𝟏 )
𝑲𝒖𝒓 =
𝑫𝟗 − 𝑫 𝟏
Sample moments as kurtosis measure
∑ (𝑥−𝑥) 4
𝑖 =1
𝑀 4 ( 𝑥)=
𝑛
Coefficient of Kurtosis - Ungrouped data
Grouped data
Measures of Kurtosis
Sample moments as kurtosis measure
Kurtosis is also measured using the fourth sample moment about the mean. The fourth moment is
The computational formula for the numerator in the case of ungrouped data is
Where again, is the number of class intervals and are the class marks.
A measure of kurtosis denoted by which is independent of the scale is given by
If , the distribution is mesokurtic. The bell shaped normal curve is usually taken as standard and has .
If , the distribution is platykurtic
If , the distribution is leptokurtic
Sometimes, we compute and refer to it as a measure of excess kurtosis. For the normal curve, the value is zero, positive for leptokurtic and
negative for platykurtic distribution.