0% found this document useful (0 votes)
8 views

ESci 117-Module 2-Lesson 2.3

Uploaded by

Gulferic Giomer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ESci 117-Module 2-Lesson 2.3

Uploaded by

Gulferic Giomer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ESci 117

Engineering Data Analysis


Module 2. Descriptive Statistics STUDENT LEARNING GUIDE
TP-IMD-02 v0 No. CET.ESC SLG20-06

Dr. Jacqueline M. Guarte

College of
ENGINEERING AND
TECHNOLOGY

Department of
AGRICULTURAL AND
BIOSYSTEMS ENGINEERING
2020
ii

Student Learning Guide in

ESci 117: Engineering


Data Analysis
Lesson 2.3: Measures of Variability

Lesson Summary
In addition to measures of location, the collected data can be farther
described in terms of their spread from the center. Such measures are
referred to as measures of variability or dispersion. Also, the boxplot is
included to gauge the symmetry in the data and to identify outliers.

Learning Outcomes
At the end of the lesson, the students should be able to use the measures of
variability appropriately; and interpret the measures of variability correctly.

Motivation Question
Why are measures of variability necessary in providing a complete description
of our collected data?

Discussion
We will not get a complete picture of our data if we only use measures of
location to describe or summarize the observed values. We must also have
measures that can tell us how different or how variable the observed values
are from each other. This becomes even more important when we want to
know whether our collected data on a characteristic of interest can be
considered homogeneous or not. Or, when we are comparing several
characteristics of interest and we want to know which is the most variable and
the least variable in this group.

Almeda et al. (2010) point out that a measure of dispersion (or variability)
determines the degree of dispersion or spread from the center of the
distribution (which can be represented by the measures of central tendency).
For at least ordinal-level data, a small value of such measure will indicate that
―the observations are not too different from each other so that there is a
concentration of observations about the center‖ (Almeda et al, 2010). In
contrast, the same authors state that a large value will ―indicate that the
observations are very different from each other so that they are widely spread
out from the center.‖

For nominal-level data, we can use the relative frequencies or


proportions of the categories studied to gauge whether the characteristic of
interest is homogeneous or heterogeneous based on the collected data. For
example, suppose we have a yes/no response to a question in a survey
conducted among coconut farmers in a certain village and the relative
frequencies are 0.9 for ―yes‖ and 0.1 for ―no.‖ This 0.9-0.1 (or 0.1-0.9) pair
tells us that the responses of the farmers are ―relatively homogeneous.‖ The
same description can be made for the pairs 0.8-0.2 (or 0.2-0.8) and 0.7-0.3
(or 0.3-0.7)as a 1.0-0 (or 0-1.0) pair reflects ―perfectly homogeneous‖
responses. For the pair 0.6-0.4 (or 0.4-0.6), we can say ―relatively
2

heterogeneous‖ responses as the pair 0.5-0.5 indicates the most


heterogeneous responses.

For at least ordinal-level data, we can use the range (the difference
between the largest and smallest observed values) but only if the values are
close to each other. It will be a misleading measure for highly variable data.
Its usual application is in quality control, where it is used ―to determine if the
production lines in a manufacturing company are in control‖ (Almeda et al,
2010). Otherwise, the median absolute deviation (MAD) will be more
appropriate as all observed values are considered in the calculation. We will
adapt Ott and Longnecker’s (2016) definition of the MAD as the median of the
absolute deviations of a set of measurements about their
median :̃

{| ̃| | ̃| | ̃ |}

Example: (Ordinal data)

Suppose 15 randomly selected household beneficiaries of a solar home


system in a certain island were asked to rate their unit’s performance after
one year of use based on the following rating scale: 1 - poor, 2 - fair, 3 –
satisfactory, 4 – very satisfactory, and 5 – excellent. Suppose further that the
sample responses were: 4, 3, 5, 3, 3, 3, 3, 3, 3, 2, 3, 3, 1, 3, 1. The range of
this data set is simply and indicates a relatively heterogeneous set
of ratings. Now, we proceed to find the MAD.

First, we find the median of the (ordered) values 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3,


3, 3, 4, 5. This is ̃ ( ) ( ) 3. Next, we determine the absolute

difference between each value and their median (using the raw data): 1, 0, 2,
0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 2. Then, we find the median of these (ordered)
absolute deviations about the median: 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2, 2.
This is just the eighth value in the array so that the . Thus, our
computed measure of variability indicates that the ratings given by the 15
sample household beneficiaries are not too different from each other (as the
15 ratings are concentrated about the center—the median rating of 3). This is
the opposite of what the range tells us.

For at least interval-level data, we can use the measures of variability that
are based on the median and the mean of the data. We present these
measures in Table8 for easy familiarization. Our choice on the measure of
variability to use will depend on which measure of central tendency is more
appropriate for the data. We recall from Lesson 2.2 (measures of location)
that the mean is not a good measure of central tendency if there are values
that are very much different from the rest. In this case, the median is
preferred as it is not affected by such values. On the other hand, we can also
use these variation measures to decide on the appropriate measure of central
tendency to use. For example, a large variance would indicate that the
observations are far or very different from the mean so that we cannot
consider the mean as a good measure of central tendency in this case
(Almeda et al, 2010).
For the measures based on the mean, the sample variance, , is used to
estimate the value of population variance, . We use instead of for
3

the divisor to improve the estimation of (Almeda et al, 2010). Otherwise,


we will tend to underestimate the population variance.

For the measures based on the median, the formulas for the sample and the
population are just the same. Ott and Longnecker (2016) explain why the
median of the absolute deviations is divided by the value 0.6745 as follows:
―In a population having a normal distribution (recall the symmetric unimodal
histogram in Lesson 2.1, Figure 6) with standard deviation , the expected
value (or mean) of the absolute deviation about the median is 0.6745 . By
dividing the MAD by 0.6745, the expected value of MAD in a population
having a normal distribution is equal to .‖ At this point in time, we will
understand this as telling us that the MAD can also be used to estimate
when the population has a normal distribution since the population mean and
median are equal in this type of distribution.

Table 1. Summary of the different measures of variability for at least interval-


level data
Measure Definition/Formula Data Requirement
Based Variance Population: Values are close to each
on the ∑ ( ) other.
Mean , average squared
difference of each observation from
the mean

Sample:
∑ ( ̅)

Standard Population:
Deviation √ , average ―distance‖ of each
observation from the mean

Sample:

Based Median Histogram represents a
on the Absolute {| ̃| | ̃| normal distribution.
Median Deviation | ̃ |}
(MAD)
Average ∑ | ̃| Presence of extreme
Deviation , average distance values; histogram is
of each observation from the median skewed.

Example:

Consider again the population data on the average monthly high


temperatures ( ) for 2019 in Manila, Philippines: 29.5, 30.2, 31.9, 33.3, 33.4,
32.1, 31.2, 30.4, 30.6, 30.9, 30.5, 29.6. The population mean and median can
be computed to be and , respectively. Since the stem-and-
leaf display indicates a positively skewed histogram, the median is the
appropriate measure of central tendency and we can choose the average
deviation (based on the median) to describe the variability of the
temperatures. The MAD is only useful for symmetric unimodal or normal
distribution. For illustration and comparison purposes, we will compute all the
four measures of variability.

1. Measures based on the mean:


4

a. Population variance:
∑ ( ) ( ) ( )

b. Population standard deviation:


√ , not representative of the average difference
from the center of the data

2. Measures based on the median:

a. Median absolute deviation:


*| ̃| | ̃| | ̃|+
A
*| | | | | |+

* +

( )
, not useful as the
distribution is not
normal

b. Average deviation:
∑ | ̃|
, appropriate
value to report

Interpretation: On the average, the 12 average monthly high


temperatures in 2019 were away from their median.

The measures of variability that we have discussed so far are called absolute
measures of variability as they are used to describe the variability of a single
characteristic. To compare the variability in two or more characteristics, we
need a relative measure. For at least interval level of measurement, the
standard deviation can also be a relative measure but only when the means
are equal and the units of measurement are the same. There is a relative
measure, called the coefficient of variation, that can be used even if these
two conditions are not met since it is unitless. Almeda et al. (2010) define the
coefficient of variation as the ratio of the standard deviation to the mean,
expressed as a percentage. The formulas for the population and for the
sample are:

and ̅

provided the mean is not zero or negative. These restrictions make sense as
Almeda et al. (2010) point out that the expresses the standard deviation
as a percentage of the mean. A large , which indicates high variability in
the data set, results whenever the standard deviation is large compared to the
size of the mean. In contrast, a small , indicating low variability in the data
set, results whenever the standard deviation is small relative to the size of the
mean. The assumption here, however, is that the mean is a good measure of
central tendency. When comparing the variability of two or more
5

characteristics using the , this assumption is ignored. For the population


data on the average monthly high temperatures in Manila, Philippines, the
is determined as follows:

Note that we will not find this value useful to interpret for this characteristic
since the appropriate measure of variability is the average deviation based on
the median. However, we can use the to compare the variability of two or
more years’ data on this characteristic.

Describing the variability of a characteristic of interest measured in at least


interval scale can include identifying the presence of “unusual values” or
outliers as this will lend support to the extent that values can be different
from each other. Also, this will give insights on the possible range of values
the characteristic of interest can take in the population under study. For these
purposes, we will use the box-and-whisker plot, or simply, the boxplot, an
important tool in exploratory data analysis. We present the method, based on
Almeda et al. (2010) and Ott and Longnecker (2016), which makes use of the
quartiles. The basic steps can be stated as follows:

1. Construct a rectangle with one end at the first quartile ( ) and the other
end at the third quartile ( ). This can be drawn vertically (y-axis is the
measurement scale) or horizontally (x-axis is the measurement scale). This
rectangle indicates where the middle 50% of the data set lie.

2. Put a line across the interior of the rectangle at the median.

3. Let be the interquartile range. This is also a measure of


dispersion. Based on this, we compute for the following ―fences‖:

These fences are cutoffs for outliers. Ott and Longnecker (2016) qualify
that ―any data value beyond an inner fence on either side is a mild outlier,
and any data value beyond an outer fence on either side is an extreme
outlier. The smallest and largest data values that are not outliers are
called the lower adjacent value and upper adjacent value, respectively.‖

4. Draw a line from each quartile to its adjacent value. These lines are
referred to as the ―whiskers.‖

5. Mark each mild outlier with a closed circle, .

6. Mark each extreme outlier with an open circle, O.


A boxplot with outliers is shown in Figure 9. We note that the four largest
observations are extreme outliers while the next four largest are mild
outliers. Since there are no negative observations, there are no outliers on
the lower end or left part of the data. This boxplot clearly shows the high
6

level of variability present in this sample data on total nitrogen loads (kg
N/day) from a particular Chesapeake Bay location in the United States. The
data were collected as part of a study to determine if the water in this bay is
―fishable and swimmable‖ ( evore, 2012).

Figure 1. A boxplot of the nitrogen load data showing mild and extreme
outliers

Source: Taken from J. evore’s Probability and Statistics for


Engineering and the Sciences, 8 th edn., Brooks/Cole, Cengage
Learning, Boston, MA, USA,
2012, p. 41.

We can examine the degree and direction of symmetry in the data by the
relative position of the line inside the rectangle to its sides as this shows the
respective distances of the median from the two quartiles (Almeda et al,
2010). Specifically, if the median line is in the middle of the rectangle, the
distribution is symmetric; if the median line is closer to the lower quartile, the
distribution is positively skewed or skewed to the right (as shown in Figure 9);
if the median line is closer to the upper quartile, the distribution is negatively
skewed or skewed to the left.

Additional information about skewness can be obtained from the lengths of


the whiskers—the longer one whisker is relative to the other, the more
skewness there is in the tail with the longer whisker (Ott and Longnecker,
2016). This is illustrated in Figure 9. However, this whisker-based
assessment may not always agree with that based on the median line. In
such situations, we follow that based on the median line. With a lot of
information supplied by a boxplot, we just need to remember that a skewed
distribution is categorically heterogeneous.

Example:

Consider again the population data on the average monthly high


temperatures ( ) for 2019 in Manila, Philippines: 29.5, 30.2, 31.9, 33.3, 33.4,
32.1, 31.2, 30.4, 30.6, 30.9, 30.5, 29.6. The median is , with
and Our interquartile range is then . Before
constructing the boxplot, we will first find the four fences to determine any
outlier in the data:

( )
( )
( )
( )
7

The boxplot for the data is shown in Figure 10. We see a positively skewed
distribution with the median line closer to the lower quartile. The 12 monthly
high temperatures comprise a heterogeneous population with no outliers.

Figure 2. A boxplot of the average monthly high temperature in


Manila, Philippines: 2019.

Average monthly high temperature ( )

Devore (2012) shares that ―a comparative or side-by-side boxplot is a very


effective way of revealing similarities and differences between two or more
data sets consisting of observations on the same characteristic or variable—
fuel efficiency for four different types of automobiles, crop yields for three
different varieties, and so on.‖ This is best done with the vertical axis as the
measurement scale as this will allow us to construct two or more boxplots,
one after the other, for each of the data sets to be compared. To better
comprehend and appreciate this possibility, two examples are presented as
shown in Figures 11 and 12 (Ott and Longnecker, 2016).
8

Figure 3. A boxplot of impurities removed using three filter types.

Source: Taken from R.L. Ott, and . Longnecker’s An Introduction to


Statistical Methods and Data Analysis, 7 th edn., Cengage
Learning, Boston, MA, USA, 2016, p.109.

Figure 4.. A boxplot of math and reading scores for each grade.

Source: Taken from R.L. Ott, and . Longnecker’s An Introduction to


Statistical Methods and Data Analysis, 7 th edn., Cengage
Learning, Boston, MA, USA, 2016, p.120.

Learning Activity
Consider the data on crack length, given its stem-and-leaf display below, from
Lesson 2.2. Perform as indicated. Follow the guide.

1. Compute the appropriate measure of variability, justify choice, and


interpret.
2. Compute a relative measure of variability.
3. Construct a boxplot and characterize farther the differences among the
data values.
9

SALD:

0H 89 96
1L 03 18 27 40 46 Stem: tens digit, H-high, L-low
1H 61 85 Leaf: one and tenths digit
2L 04 12 33 42 49
2H 53 58 71 85 or Unit = 0.1
3L 02 24
3H
4L
4H 50

1. Appropriate measure of variability:


Justification:
Interpretation:

2. Relative measure of variability:

4. Summary measures needed for the boxplot:

The boxplot for this data set is:

References
ALMEDA, J.V., T.S. CAPISTRANO, and G.M.F. SARTE. 2010. Elementary
Statistics. The University of the Philippines Press, Quezon City. pp.
231-233, 236-238, 253, 430-434.

DEVORE, J. 2012. Probability and Statistics for Engineering and the


Sciences, 8th edn. Brooks/Cole, Cengage Learning, Boston, MA, USA.
p.41.

OTT, R.L. and M. LONGNECKER. 2016. An Introduction to Statistical


Methods and Data Analysis, 7 th edn. Cengage Learning, Boston, MA,
USA. pp. 98-100, 106-109,120.
10

Module Posttest
Instruction: Answer the following questions to the best of your ability.

1. Why does a stem-and-leaf display resemble a frequency distribution?

2. How does a relative frequency histogram summarize data?

3. What are quantiles?

4. How does a boxplot show outliers?


DEPARTMENT OF
AGRICULTURAL AND BIOSYSTEMS ENGINEERING
College of Engineering and Technology

For inquiries, contact:

ENGR. ELDON P. DE PADUA


[email protected][email protected]
+63 53 565 0600 Local 1015

Use this code when referring to this material:


TP-IMD-02 v0 07-15-20 • No. CET.ABE SLG20-06

Visca, Baybay City, Leyte


Philippines 6521
[email protected]
+63 53 565 0600

You might also like