Describing Data: Centre Mean Is The Technical Term For What Most People Call An Average. in Statistics, "Average"
Describing Data: Centre Mean Is The Technical Term For What Most People Call An Average. in Statistics, "Average"
Describing Data
This worksheet focuses on describing data through measuring its centre and variability.
These measurements will give us an idea of what our data set looks like.
CENTRE
There are three ways to describe the centre of a data set: mean, median, and mode.
Mean is the technical term for what most people call an average. In statistics, “average”
is too vague, so you aren’t likely to see it. The mean represents the typical value of a
data set.
To find the mean, we first take the sum of all numbers in the data set. The symbol for
taking the sum of a set of numbers is the capital Greek letter sigma, Σ, so “Σ x” means
“take the sum of all values of x”. Then we divide by n, the number of observations in the
data set. You will see the notation for mean represented two ways: x⎯ (pronounced “x
bar”) is used when we are finding the average of a sample of data, and μ (pronounced
“mew”, the Greek letter mu) represents the mean of a population of data. The
population is all the members of the group of interest (e.g., all ducks) while a sample is
a smaller group, or subset, of the population (e.g., 250 ducks at Trout Lake).
Example 1: Find the mean of the following sample: {3, 5, 4, 9, 8, 5, 7, 8, 9, 12}
Solution: We take the sum of all numbers, and then we divide by n, which is 10:
Σ xi = 3 + 5 + 4 + 9 + 8 + 5 + 7 + 8 + 9 + 12 = 70
Σ xi 70
x⎯ = = =7
n 10
The mean of our data set is 7. The formula above can be used to calculate the mean of
any data set.
The median is the middle value of an ordered data set. If there is an odd number of
observations, then the median is the middle number. If there is an even number of
observations, we take the mean of the two middle numbers. We find the position of the
central observation using the formula:
n+1
position number = 2
The median is a more useful measure of central tendency if the data is skewed,
meaning that the data favours high numbers over low numbers, or vice versa. In
graphical form, a skewed curve appears asymmetrical, with one longer tail leading off in
the direction of skew. (So data that is right skewed has a long tail to the right.)
VARIABILITY
The range is the difference between the highest (maximum) and lowest (minimum)
value in the data set, but it doesn’t tell us much. If a data set has 50 observations, of
which one is 20, another is 100, and the rest are all 60, the range is 100 − 20 = 80, but
that’s not a true measure of how much the data varies.
The most common measure of variability of a data set is standard deviation. It reflects
the deviations, or differences, of all values in the data set from the mean. A larger
standard deviation would indicate greater variability for a data set.
If you calculated the mean mark on a class midterm to be 65, that only tells you the
average mark. Did the marks in the class look like {66, 64, 67, 66, 62, 70, …} or like {48,
97, 83, 57, 62, 81, …}? The first set of marks has low standard deviation—most of the
marks are quite close to the mean. The second set has a higher standard deviation as
there is a greater spread of values from the mean. The notation for standard deviation
of a population is σ (sigma again, but this is the lower case form of the Greek letter).
The notation for standard deviation of a sample is s. To calculate standard deviation we
use the following formulas:
Population Standard Deviation Sample standard deviation
Σ ( x i − μ) 2 2
Σ ( x i − x )2 Σ ( x 2 ) − (Σnx )
σ= s= =
n n −1 n −1
The rightmost formula for sample standard deviation is the easiest one to use for
calculating s by hand. Variance is another related measure of variability that is simply
the square of the standard deviation (σ2 or s2). Variance is less useful as a measure of
variability, since its scale often doesn’t match the spread of the data. (Data that varies
by no more than 10 might have a variance of 20. If so, the 20 is meaningless in the
Example 4: Calculate the standard deviation of the data set from Example 1.
Solution: We know n = 10 from Example 1. We also know Σ x = 70 from Example 1.
The only term left to figure out is Σ(x2). Σ(x2) is the sum of the square of all data values:
Σ(x2) = 32 + 52 + 42 + 92 + 82 + 52 + 72 + 82 + 92 + 122 = 558
Now we plug into the formula:
558 − 70
2
10
s= = 7.555555 ... ≈ 2.75
10 − 1
Another way to measure variability is by using quartiles. This is a more accurate
description of variability than the standard deviation if a data set has strong outliers
(values that lie far away from the rest of the data) or is strongly skewed.
The first quartile (Q1) is the data point that lies above ¼ (25%) of all the points of the
data set, and the third quartile (Q3) is the point that lies above ¾ (75%) of all the data
points. The second quartile, which lies above ½ of all data points, is simply the median.
To calculate a first quartile, find the median of the “first half” of the data, and to calculate
a third quartile, find the median of the “last half” of the data. If the data set has an odd
number of observations, then the median is not included in either half of the data for the
purposes of finding quartiles.
The interquartile range (IQR) is the difference between the third quartile and first
quartile: Q3 − Q1. This range will include the middle 50% of the values of the data set.
Example 5: For the data set in Example 1, find the first and third quartiles, and the IQR.
Solution: We know the median is the 5.5th position, so the first half of the data are in
positions #1 – #5, and the last half are in positions #6 – #10. Consider the first half of
the observations and find its median. The position of the first quartile is 5+21 = #3. The
position of the third quartile is 6+210 + 3 = #8. Within the data set:
{3, 4, 5, 5, 7, 8, 8, 9, 9, 12}
Q1 Q3
Therefore Q1 = 5, and Q3 = 9. The IQR is 9 − 5 = 4.
A good summary of a data set is called a five-figure summary. It includes the minimum
value of the set, the first quartile, the median, the third quartile and the maximum value.
The five-figure summary gives a good idea of the spread of data as well as the
presence of any outliers in the data set. In the following example, there’s a serious
outlier at one end of the data, but the description of the centre won’t be affected by it:
EXERCISES
A. For the following sets of data, calculate (a) sample mean, (b) median, (c) mode, (d)
range, (e) standard deviation, (f) first quartile, (g) third quartile, (h) interquartile range,
and (i) the five-figure summary. Round to two decimal places where appropriate.
1) {8, 24, 9, 6, 10, 18, 7, 14, 16, 21, 13, 24}
2) {3, 6, 5, 4, 6, 5, 9, 10, 11, 7, 9}
3) {41, 39, 38, 42, 43, 39, 40, 43, 26, 42, 42, 41, 41, 42, 27, 55, 60
B. Identify the data set (1, 2, or 3 according to the numbered exercises above) with the
greatest variability based on (a) standard deviation, (b) range and (c) IQR.
C. Explain why the answers to B(a) and B(b) are different from B(c).
SOLUTIONS
A. 1) (a) 14.17 (b) 13.5 (position #6.5) (c) 24 (d) 18 (e) 6.46 (f) 8.5 (position #3.5)
(g) 19.5 (position #9.5) (h) 11 (i) 6, 8.5, 13.5, 19.5, 24
2) (a) 6.82 (b) 6 (position #6) (c) 5, 6 and 9 (d) 8 (e) 2.60 (f) 5 (position #3)
(g) 9 (position #9) (h) 4 (i) 3, 5, 6, 9, 11
3) (a) 41.24 (b) 41 (position #9) (c) 42 (d) 34 (e) 7.93 (f) 39 (position #4.5)
(g) 42.5 (position #13.5) (h) 3.5 (i) 26, 39, 41, 42.5, 60
B. a) Data set #3 has the greatest standard deviation.
b) Data set #3 has the greatest range.
c) Data set #1 has the greatest IQR.
C. Data set #3 has outliers at both ends of the data. Because of this, data set #3 has a
high standard deviation and range. However, the IQR is less sensitive to outliers.
For this reason, the IQR of data set #3 is much lower than that of data set #1.