2a. Describing Variables with Numbers
2a. Describing Variables with Numbers
Numbers
Median
Mode
• Variability Measures
Range
Quartiles
Interquartile Range
n Upper Limit
∑ xi
Summation sign
variable
i =1
Index of summation
Lower Limit
n
∑x
i =1
i = x1 + x2 + x3 + ... + xn
∑x
i =1
i = x1 + x2 + x3 + x4 + x5
∑x
i =3
i = x3 + x4 + x5
∑x = x 1 + x2 + x3 + ... + xn
Sum the squared values of x
n
∑ xi = x1 + x2 + x3 + ... + xn
2 2 2 2 2
i =1
i =1
∑x y
i =1
i i = x1 y1 + x2 y2 + ... + xn yn
Summation Notation
∑x i =1
i = 10 + 8 + 6 + 4 + 2 = 30
i xi yi
5
1 10 0 ∑y i =1
i = 0 + 3 + 6 + 9 + 12 = 30
2 8 3
3 6 6 n
∑ i = + + + + = 220
2 2 2 2 2 2
x 10 8 6 4 2
4 4 9 i =1
5 2 12
2
n
∑ yi = (0 + 3 + 6 + 9 + 12 ) = (30 ) = 900
2 2
i =1
∑x y
i =1
i i = (10 * 0) + (8 * 3) + (6 * 6) + (4 * 9) + (2 *12) = 120
Central Tendency Measures
• Researchers may use several measures to describe the center of the
distribution.
• These statistics primarily differ in how much information they use when
describing data.
• Model use the least amount of information while mean uses the most
amount of information.
Mode
• Mode is the simplest measure among the central tendency measures and
has a very simple definition.
• Mode of any variable is the most frequent number appears in the data.
8, 9, 8, 7, 10, 9, 6, 4, 9, 8, 7, 8, 10, 9, 8, 6, 9, 7, 8, 8
It is easier to sort the data first.
4, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10
Sometimes there may be two values observed for the same number of
subjects. Consider the following example.
4, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 10, 10
In this case, one can argue that the data is bimodal and would report both
modes.
Median
• Median is simply the midpoint when you sort the data in increasing or decreasing
order.
• Midpoint of data is the location which 50% of the observations fall at or below that
value.
• In order to find the median, you first add 1 to the number of observations and then
divide the total by two. This is the midpoint.
If the number of observations is an odd number, you find the corresponding
value in that particular position.
If the number of observations is an even number, you choose the two values
around the midpoint and take the average.
Consider the following example where there are 11 observations.
4, 6, 7, 7, 7, 8, 9, 9, 9, 10, 10
The midpoint would be the sixth number in this series, (11+1)/2 = 6.
Therefore the median is 8.
4, 6, 7, 7, 7, 8, 9, 9, 9, 10
• The mean is the sum of all the values in the data divided by the number of
observations.
∑x i
x= i =1
N
Consider the following variable:
4, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10
There are 20 observations and the sum of all these values are 158. Therefore, the mean is
7.9.
N
∑x i
158
=
x i =1
= = 7.9
N 20
Relative Advantages and Disadvantages of Mean, Median, and Model
• The major consideration about selecting a measure for describing data is
how well it represents a variable.
• If the variable has symmetric distribution and doesn’t include extreme
scores, mean is a natural choice and mean, median, mode will be always
very close to each other.
• It is very advantageous to work with mean as we can write an equation to
define a mean and manipulate the mean algebraically. There is not a
standard formula to define the mode or median.
• Also, as we introduce the Central Limit Theorem later, the mean is always a
better representation of the population mean compared to median and
mode.
• One concern about using the mean to represent a variable is the existence
of extreme scores. Mean is not resistant to extreme scores and can mislead
about the center of the distribution when there are extreme scores. In such
cases median could be a better choice.
• Consider the following example with no extreme scores. The mean is equal
to 7.9 and the median is 8. Therefore, they can be both used to represent
the center of variable and would provide an accurate representation.
4, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10
Variability Measures
• While the measures of central tendency reflect the general location of the
most of the values in a variable, they don’t reflect how spread the values
are.
• Variability measures are used to reflect the how much values are different
from each other.
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8
4, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10
While the center of two variables is about same, one has no variation while the
other has some variation. Variability measures add one more dimension to
describe data.
• Some major statistics used to describe the variability:
Range
Interquartile range
Standard Deviation/Variance
Range
4, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10
Min = 4
Max = 10
Range = 10 – 4 = 6
Interquartile Range (IQR)
• Interquartile range is the distance between the first quartile and third quartile
1st quartile = the point that cuts off the lowest 25% of the data (Q1)
3rd quartile = the point that cuts off the upper 25% of the data (Q3)
• IQR = Q3 – Q1
• Note that we defined Median before as the 50% percentile, the point that
cuts the data just in halves. It is called as 2nd quartile (Q2)
Range
Interquartile Range
Minimum Q1 Q2 Q3 Maximum
First Quartile Second Quartile Third Quartile
Median
62 64 66 70 72 76 77 81 81
Q1 Q2 Q3
Second Quartile
Median
Minimum = 62 Q1 = (64+66)/2 = 65
Maximum = 81 Q3 = (77+81)/2 = 79
Variance is the average squared deviation of each value in the data from the
center.
N
∑(x − µ)
2
i
σ2 = i =1 Population Variance
N
N
∑(x − x )
2
i
Sample Variance
s2 = i =1
N −1
Mean = 72.1
(62 - 72.1)2 (64 - 72.1)2 (66 - 72.1)2 (70 - 72.1)2 (72 - 72.1)2 (76 - 72.1)2 (77 - 72.1)2 (81 - 72.1 )2 (81 - 72.1)2
102.0 65.6 37.2 4.4 0.0 15.2 24.0 79.2 79.2
406.9
=s2 = 50.9
Sum = 406.9 9 −1
• The variance is a commonly used statistic to describe the variability in the
data and a very important concept.
• In order to make it easier to interpret, people typically take its square root
and report standard deviation.
• You may sometimes see that people talk about z-score or T-scores
• These are simply linear transformations of observed values to provide better
comparison across datasets with different dispersion.
• Suppose a student takes a math test and a reading test. The maximum
possible score is 100 for both tests. Student receives a score of 50 from the
math test and a score of 80 from the reading test.
In an absolute comparison, one can argue that student did better in the reading test than
math test.
Maybe, the math test was more difficult than the reading test. Comparing the scores to the
rest of class can provide a better picture.
In order to do so, we would need to know how well others in the class performed in both
test.
Measures of Relative Standing
One can compute how much the student’s score of 50 in the math test different
from the rest of the class. By using the mean and standard deviation as summary
measures of the data for the rest of class, we can compute that the student’s
score is actually 1 standard deviation above the class average. So, the student
actually did very well in math compared to the rest of the class.
50 − 40
=1
10
Measures of Relative Standing
One can also compute how much the student’s score of 80 in the reading test
different from the rest of the class. By using the mean and standard deviation as
summary measures of the data for the rest of class, we can compute that the
student’s score is actually 2 standard deviation below the class average. So,
the student actually didn’t do very well in reading compared to the rest of the
class.
80 − 90
= −2
5
So, we can compute the relative standing of the student compared to the rest of
the class using the mean and standard deviation for the rest of the class.
You can obtain raw data and convert the raw data to z-scores by subtracting the
mean from each observation and then by dividing the standard deviation.
z-score tells us how much a certain value is away from the mean in standard
deviation units.
x−x
z=
s