SSGC 8802 - Descriptive Statistics
SSGC 8802 - Descriptive Statistics
S S G C 8 8 0 2 Q U N AT I TAT I V E R E S E A R C H M E T H O D S
S H U FA I C H E U N G
FA C U LT Y O F S O C I A L S C I E N C E S
UNIVERSITY OF MACAU
2 0 2 4 - 2 0 2 5 1 ST S E M E S T E R
Categorical
◦ Also known as nominal or qualitative. If only two possible states, also called
binary
Ordinal
◦ Also known as rank-order
Interval
◦ Also known as equal-interval
Ratio
◦ Called "scale" in some programs but ratio is a less ambiguous term.
SF Cheung 2024-09
SSGC 8802 Levels of Measurement
2
Categorical “variables”
“Name”: Also called nominal variables
E.g., faculty, ethnicity, the shape of a person’s face, hair style, etc.
Beside denoting different states, the values have no meaningful
order.
We cannot say meaningfully that Category A is
greater/less/fewer/smaller/etc. than Category B.
If a nominal variable has only two possible values (e.g., true/false),
then it is called a binary variable or dichotomous variable.
SF Cheung 2024-09
SSGC 8802 Levels of Measurement
3
Ordinal “variables”
Similar to nominal variables, but the states (categories) can
be ordered meaningfully. The order of the categories
correspond to the order of the attribute being measured.
However, we do not assert that the difference between two
adjacent states is constant.
E.g., the 1st is better than the 2nd, and the 2nd better than
the 3rd. However, the difference between the 1st and the 2nd
may or may not be equal to the difference between the 2nd
and the 3rd.
SF Cheung 2024-09
SSGC 8802 Levels of Measurement
4
Interval “variables”
We assert that equal (numeric) difference
between two values implies equal difference in
the underlying attribute being measured.
E.g.,
◦(8 meters – 4 meters) equals (7 meters – 3 meters) equals
(12.573 meters – 8.573 meters) in terms of the length
being measured.
SF Cheung 2024-09
SSGC 8802 Levels of Measurement
5
Ratio “variables”
Like interval variables, equal difference between two values implies
equal difference in the underlying attribute being measured.
However, for ratio scale, the “zero” is “meaningful” (e.g., means
“nothing,” or the absence of the attribute being measured) and we
call this zero the absolute zero.
Consequently, the ratio of two values are also meaningful :
Represent the ratio on the attribute being measured. Hence the
name ratio variable.
E.g., height, the number of children, the number of courses you have
taken in this semester.
SF Cheung 2024-09
SSGC 8802 Levels of Measurement
6
Why These Levels?
Why levels of measurement? Because the level of measurement
may suggest the appropriate statistical procedure to analyze the data.
◦ Note I only use suggest above.
SF Cheung 2024-09
SSGC 8802 Levels of Measurement
7
Levels of “Measurement” or
Levels of “Variable”?
Unfortunately, the use of these levels sometimes can be confusing.
It is frequent that, when deciding what methods to use on the data, we
talked about attributes of a column like “categorical,” “numerical,”
“quantitative,” “qualitative,” etc., and then think that the labels (levels?)
we assign “determine” the methods we can or should use.
Think about these examples:
- How frequent a person conducted a behavior (e.g., playing video games)
during the last seven days. How would we measure this attribute?
- Whether a person is an FSS student or a non-FSS student. “Can” we
compute the “mean” of this variable? Can we talk about the “increase” or
“decrease” of this variable?
SF Cheung 2024-09
SSGC 8802 Why Quantitative Methods
8
Suggested Readings for Levels of Measurement
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103,
677–680.
◦ This is one of the classic papers on the four levels of measurement.
Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio
typologies are misleading. The American Statistician, 47, 65–72.
doi:10.2307/2684788
◦ The classification scheme is not without its criticisms. This paper highlights the potential
problems of this scheme from a statistical point of view.
SF Cheung 2024-09
SSGC 8802 Levels of Measurement
9
Data Analysis: Describe (Explore) First!
After we collected a set of data (a collection of values on a variable),
whatever the types of variables, we usually want to describe the
data set.
One of the most straightforward questions is: “How many of them
have this value?”
E.g.,
◦ Gender: How many females? How many males?
◦ Grade: How many have an A? B? C? …
◦ Age: How many are 17-year-old? How many 18? …
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
10
Frequency Distribution: Frequency Table
Sample data Value Frequency Percent (%)
1 1 5%
◦ 6, 7, 4, 5, 3, 8, 3, 5, 4, 3, 4,
5, 3, 7, 1, 6, 3, 4, 3, 5 2 0 0%
3 6 30%
Ordered
4 4 20%
◦ 1, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5 4 20%
5, 5, 5, 5, 6, 6, 7, 7, 8 6 2 10%
Grouped → 7 2 10%
8 1 5%
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
11
Grouped Frequency Distribution
Sample data
Value Frequency Percent (%)
◦ 6, 7, 4, 5, 3, 8, 3, 5, 4, 3,
1-2 1 5%
4, 5, 3, 7, 1, 6, 3, 4, 3, 5
3-4 10 50%
5-6 6 30%
Useful when we have data 7-8 3 15%
like this:
4.52, 4.93, 4.72, 5.76, 5.96, 6.21, 4.93, 4.88, 3.04, 4.15, …
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
12
Frequency Distribution: Histogram
• Grouped frequency
table
◦ No space between bars
◦ Area represents
frequency and proportion
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
13
Histogram: The Effect of Grouping
1) Central tendency
• Is there one point or value that can “represent” the data?
• For a numeric variable, where is the “center”?
2) Variation (a.k.a. dispersion)
• How vary are the values?
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
15
Central Tendency: (Arithmetic) Mean
• Denote the number of data points, i.e., sample size, by N
𝑁
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
16
Central Tendency: Median
We want to find the "middle" value to represent the scores.
Odd number of cases
◦ The one in the middle case, the (N+1)/2th case, is the median.
Note: There are other ways to find the median but this is usually not a major
concern if a variable has many different values and the sample size is large.
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
17
Central Tendency: Mode
We want to find the value with the largest number of cases to represent the
data.
A dataset may have more than one mode.
What if there are many unique values, each with only one or a few cases?
◦ E.g., the intelligence scores (from 0 to 100) of a group of persons.
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
18
Central Tendency: Evaluation
Mean:
◦ Good: In the sense all data points can influence the mean, a mean represents all of them.
◦ Bad: An extreme value (outlier) can have a large impact on the mean.
Median:
◦ Good: Robust, i.e., insensitive to extreme values (outliers)
◦ Bad: Not all kind of change in a single data point is reflected.
Mode:
◦ Good: Robust, i.e., insensitive to a small number of unusual cases
◦ Bad: Many kinds of changes are NOT reflected.
Note: Note that the term robust has different meanings and can be used to mean
totally different things. Pay attention to the context.
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
19
Central Tendency: Other Measures
In addition to the three measures presented, there are many other
possible measures of central tendency. E.g.,
◦ Trimmed mean: A 10% trimmed mean is an arithmetic mean with bottom and top
10% cases removed.
◦ Winsorized mean: Replace values more extreme than predefined cutoffs (lower than
the lower cutoff or higher than the upper cutoff) by these cutoffs and then compute
the arithmetic mean.
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
20
Variation: Variance
With a numeric variables, if we decided to use the (arithmetic) mean as a
measure of the center, for each case, we can ask these questions:
◦ “How different is a case from the mean?” or “How far away is a case from the
mean?”
◦ “Is a case below, equal to, or above the mean?”
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
21
Variation: Variance (cont’d)
Take an average (the mean deviation) is certainly not a good solution because
◦ High and low scores will cancel out, e.g., (+5) + (-5) = (0).
◦ But +5 and -5 are as badly represented by the mean (or as deviated from the mean in
magnitude)!
Therefore, we square each deviation before computing the average.
ത
◦ Compute the deviation from the mean (𝑋)
𝑁
◦ Square the deviation from the mean.
𝑖=1 𝑋𝑖 − 𝑋ത 2
◦ Compute the average, we then have the variance:
𝑁
σ2 is usually used to denote variance.
• IMPORTANT:
◦ This is a descriptive statistic to describe a finite set of values. As 𝑁
you may have learned, when we want to estimate the population 𝑖=1 𝑋𝑖 − 𝑋ത 2
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
22
Variation: Standard Deviation
However, variance is not in the original unit.
◦ If a variable is in kg, the mean is in kg, but the variance is in kg2.
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
23
Variation: Evaluation
Variance and SD:
◦ Good: Like mean, all data points can influence the variance and standard
deviation.
◦ Bad: Like mean, the variance and standard deviation are sensitive to outliers.
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
24
Two More Aspects (1): Symmetry
A distribution can be asymmetric. We use skewness to measure the this. Skewness = 0 for a
symmetric distribution.
Which direction does the “tail” point to? The negative end or the positive end?
Low High
(Light tail)
Kurtosis (Heavy tail)
• Estimate of population kurtosis is complicated and
• Conceptual definition for a finite
sample: not presented here.
𝑁 4 • Raw kurtosis is 3 for a “normal” distribution (to be
𝑖=1 𝑋𝑖 − 𝑀 presented).
Raw Kurtosis = 𝑁 • Therefore, excess kurtosis (raw kurtosis – 3) is
𝑆𝐷4 usually used, which is zero for normal distribution.
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
26
A Commonly Used Distribution: Normal Distribution
• Then what is the normal
distribution?
◦ It is a family of theoretical
distributions that are used in
testing and estimation frequently.
• Non-technically speaking,
it is bell-shaped because …
◦ the most likely value is mean
(highest frequency), and
◦ farther away from mean, more
unlikely (lower frequency).
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
27
A Commonly Used Distribution: Normal Distribution (cont’d)
• The normal distribution is a
theoretical distribution
because it is a mathematical
model. It nearly never occur
in real data. However, it may
be a good approximation.
• It is a family of distributions
because there is no “the
normal distribution”. You
change the unit and this will
"change the distribution",
the absolute position on a
2D plane.
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
28
Normal Distribution and Standardized Scores (Z scores)
• Definition •
X: The score to be
X −M transformed.
Z= • M: Mean
SD • SD: Standard Deviation
➢Z score is useful to give a meaning to a score, especially when the unit
does not have a concrete meaning, which is not unusual in social
sciences.
➢E.g., for a scale of 1 to 10 on happiness, what does it mean for you
to have a score of 6? Is it high? Low? How high or how low?
➢Think about the following z scores? What do they mean?
➢0, 1, -1, 2, -2
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
29
Normal Distribution, Z scores, and
THE standard normal distribution
➢A standard normal distribution is
a normal distribution with mean =
0 and SD = 1.
➢A normal distribution of any
variable in any unit, if
transformed to z score, will have a
standard normal distribution
➢Why? Try to answer these two
questions:
➢What is the mean of a set of z
scores?
➢What is the SD of a set of z
scores?
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
30
Normal Distribution, Z scores, and
THE standard normal distribution (cont’d)
➢If we believe that a
distribution is close to a
normal distribution, we
know that if converted to
z scores, the distribution About About About About About
is close to a standard 2% 14% 68% 14% 2%
normal distribution.
➢Then we know
approximately the
percentage of cases in any
range of the distribution.
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
31
Descriptive Statistics: Final Remarks
There are a lot of way to describe a dataset.
Nevertheless, what covered here are the basic ways that:
◦ Give you an idea the possible pros and cons of various methods.
◦ They are usually the basic statistics that you find in published papers, and
those that you will report in your papers.
SF Cheung 2024-09
SSGC 8802 Descriptive Statistics
32