ESci 117-Module 2-Lesson 2.3
ESci 117-Module 2-Lesson 2.3
College of
ENGINEERING AND
TECHNOLOGY
Department of
AGRICULTURAL AND
BIOSYSTEMS ENGINEERING
2020
ii
Lesson Summary
In addition to measures of location, the collected data can be farther
described in terms of their spread from the center. Such measures are
referred to as measures of variability or dispersion. Also, the boxplot is
included to gauge the symmetry in the data and to identify outliers.
Learning Outcomes
At the end of the lesson, the students should be able to use the measures of
variability appropriately; and interpret the measures of variability correctly.
Motivation Question
Why are measures of variability necessary in providing a complete description
of our collected data?
Discussion
We will not get a complete picture of our data if we only use measures of
location to describe or summarize the observed values. We must also have
measures that can tell us how different or how variable the observed values
are from each other. This becomes even more important when we want to
know whether our collected data on a characteristic of interest can be
considered homogeneous or not. Or, when we are comparing several
characteristics of interest and we want to know which is the most variable and
the least variable in this group.
Almeda et al. (2010) point out that a measure of dispersion (or variability)
determines the degree of dispersion or spread from the center of the
distribution (which can be represented by the measures of central tendency).
For at least ordinal-level data, a small value of such measure will indicate that
―the observations are not too different from each other so that there is a
concentration of observations about the center‖ (Almeda et al, 2010). In
contrast, the same authors state that a large value will ―indicate that the
observations are very different from each other so that they are widely spread
out from the center.‖
For at least ordinal-level data, we can use the range (the difference
between the largest and smallest observed values) but only if the values are
close to each other. It will be a misleading measure for highly variable data.
Its usual application is in quality control, where it is used ―to determine if the
production lines in a manufacturing company are in control‖ (Almeda et al,
2010). Otherwise, the median absolute deviation (MAD) will be more
appropriate as all observed values are considered in the calculation. We will
adapt Ott and Longnecker’s (2016) definition of the MAD as the median of the
absolute deviations of a set of measurements about their
median :̃
{| ̃| | ̃| | ̃ |}
difference between each value and their median (using the raw data): 1, 0, 2,
0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 2. Then, we find the median of these (ordered)
absolute deviations about the median: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2.
This is just the eighth value in the array so that the . Thus, our
computed measure of variability indicates that the ratings given by the 15
sample household beneficiaries are not too different from each other (as the
15 ratings are concentrated about the center—the median rating of 3). This is
the opposite of what the range tells us.
For at least interval-level data, we can use the measures of variability that
are based on the median and the mean of the data. We present these
measures in Table8 for easy familiarization. Our choice on the measure of
variability to use will depend on which measure of central tendency is more
appropriate for the data. We recall from Lesson 2.2 (measures of location)
that the mean is not a good measure of central tendency if there are values
that are very much different from the rest. In this case, the median is
preferred as it is not affected by such values. On the other hand, we can also
use these variation measures to decide on the appropriate measure of central
tendency to use. For example, a large variance would indicate that the
observations are far or very different from the mean so that we cannot
consider the mean as a good measure of central tendency in this case
(Almeda et al, 2010).
For the measures based on the mean, the sample variance, , is used to
estimate the value of population variance, . We use instead of for
3
For the measures based on the median, the formulas for the sample and the
population are just the same. Ott and Longnecker (2016) explain why the
median of the absolute deviations is divided by the value 0.6745 as follows:
―In a population having a normal distribution (recall the symmetric unimodal
histogram in Lesson 2.1, Figure 6) with standard deviation , the expected
value (or mean) of the absolute deviation about the median is 0.6745 . By
dividing the MAD by 0.6745, the expected value of MAD in a population
having a normal distribution is equal to .‖ At this point in time, we will
understand this as telling us that the MAD can also be used to estimate
when the population has a normal distribution since the population mean and
median are equal in this type of distribution.
Sample:
∑ ( ̅)
Standard Population:
Deviation √ , average ―distance‖ of each
observation from the mean
Sample:
√
Based Median Histogram represents a
on the Absolute {| ̃| | ̃| normal distribution.
Median Deviation | ̃ |}
(MAD)
Average ∑ | ̃| Presence of extreme
Deviation , average distance values; histogram is
of each observation from the median skewed.
Example:
a. Population variance:
∑ ( ) ( ) ( )
* +
( )
, not useful as the
distribution is not
normal
b. Average deviation:
∑ | ̃|
, appropriate
value to report
The measures of variability that we have discussed so far are called absolute
measures of variability as they are used to describe the variability of a single
characteristic. To compare the variability in two or more characteristics, we
need a relative measure. For at least interval level of measurement, the
standard deviation can also be a relative measure but only when the means
are equal and the units of measurement are the same. There is a relative
measure, called the coefficient of variation, that can be used even if these
two conditions are not met since it is unitless. Almeda et al. (2010) define the
coefficient of variation as the ratio of the standard deviation to the mean,
expressed as a percentage. The formulas for the population and for the
sample are:
and ̅
provided the mean is not zero or negative. These restrictions make sense as
Almeda et al. (2010) point out that the expresses the standard deviation
as a percentage of the mean. A large , which indicates high variability in
the data set, results whenever the standard deviation is large compared to the
size of the mean. In contrast, a small , indicating low variability in the data
set, results whenever the standard deviation is small relative to the size of the
mean. The assumption here, however, is that the mean is a good measure of
central tendency. When comparing the variability of two or more
5
Note that we will not find this value useful to interpret for this characteristic
since the appropriate measure of variability is the average deviation based on
the median. However, we can use the to compare the variability of two or
more years’ data on this characteristic.
1. Construct a rectangle with one end at the first quartile ( ) and the other
end at the third quartile ( ). This can be drawn vertically (y-axis is the
measurement scale) or horizontally (x-axis is the measurement scale). This
rectangle indicates where the middle 50% of the data set lie.
These fences are cutoffs for outliers. Ott and Longnecker (2016) qualify
that ―any data value beyond an inner fence on either side is a mild outlier,
and any data value beyond an outer fence on either side is an extreme
outlier. The smallest and largest data values that are not outliers are
called the lower adjacent value and upper adjacent value, respectively.‖
4. Draw a line from each quartile to its adjacent value. These lines are
referred to as the ―whiskers.‖
level of variability present in this sample data on total nitrogen loads (kg
N/day) from a particular Chesapeake Bay location in the United States. The
data were collected as part of a study to determine if the water in this bay is
―fishable and swimmable‖ ( evore, 2012).
Figure 1. A boxplot of the nitrogen load data showing mild and extreme
outliers
We can examine the degree and direction of symmetry in the data by the
relative position of the line inside the rectangle to its sides as this shows the
respective distances of the median from the two quartiles (Almeda et al,
2010). Specifically, if the median line is in the middle of the rectangle, the
distribution is symmetric; if the median line is closer to the lower quartile, the
distribution is positively skewed or skewed to the right (as shown in Figure 9);
if the median line is closer to the upper quartile, the distribution is negatively
skewed or skewed to the left.
Example:
( )
( )
( )
( )
7
The boxplot for the data is shown in Figure 10. We see a positively skewed
distribution with the median line closer to the lower quartile. The 12 monthly
high temperatures comprise a heterogeneous population with no outliers.
Figure 4.. A boxplot of math and reading scores for each grade.
Learning Activity
Consider the data on crack length, given its stem-and-leaf display below, from
Lesson 2.2. Perform as indicated. Follow the guide.
SALD:
0H 89 96
1L 03 18 27 40 46 Stem: tens digit, H-high, L-low
1H 61 85 Leaf: one and tenths digit
2L 04 12 33 42 49
2H 53 58 71 85 or Unit = 0.1
3L 02 24
3H
4L
4H 50
References
ALMEDA, J.V., T.S. CAPISTRANO, and G.M.F. SARTE. 2010. Elementary
Statistics. The University of the Philippines Press, Quezon City. pp.
231-233, 236-238, 253, 430-434.
Module Posttest
Instruction: Answer the following questions to the best of your ability.