Review Question - C2 - SACR3080
Review Question - C2 - SACR3080
Notes:
Mode (Mo): The most frequently observed category or value in a dataset.
Median (Md): The value that divides an ordered dataset into two equal
halves. It is also the 50th percentile (P50), marking the point where 50% of the data
lies below and 50% above.
Mean (x̄ ): The average of all values in a dataset, calculated by summing
the values (∑xi) and dividing by the number of observations (N). Often referred to
as "x-bar” = ∑ xi/N.
Review Questions:
1. What rules of thumb are sometimes offered as to how levels of measurement should be
associated with measures of central tendency?
Nominal: Use the mode since there is no inherent order, and the mean and median cannot
be applied.
Ordinal: Use the median because the data has a rank or order, but the distances between
categories are not precise. The mode is also applicable but doesn't take advantage of
ordering.
Interval/Ratio: Use mode, median, or mean, as the data is ordered and the distances
between values are consistent.
It is acceptable to use a lower-level measure (like the median or mode) even with
interval or ratio data, depending on the goal of the analysis. For instance, the median
might be a better choice when the mean does not represent the central tendency well due
to outliers.
If we have interval or ratio data, and we have reason to point out the value which divides
our observations into upper and lower halves, the median may be preferable to the mean.
We may also prefer the median when it better represents the bulk of our cases.
3. What are three situations in which we might want to use a mode (a) and one in which it could
be misleading (b)?
(a) The mode is preferable when dealing with nominal or ordinal data, in the presence of skewed
distributions or outliers, and when identifying common categories or preferences is more
relevant than calculating an average.
Explanation:
Nominal: when dealing with nominal data (categories without a specific order), the mode
is the only measure of central tendency that can be used. It helps identify the most
frequently occurring category, such as the most popular product in a survey.
Non-Normal Distributions: in distributions that are skewed or have outliers, the mode
can provide a better representation of the most common value than the mean. This is
because the mode is not influenced by the magnitude of extreme values.
Ordinal Data: in ordinal datasets, where the values have a meaningful order but the
intervals between values are not consistent, the mode can effectively represent the most
common ranking or preference without assuming equal distances between categories.
Descriptive Statistics in Specific Fields: in certain fields such as marketing, sociology,
and psychology, identifying the most common response or behavior can be more valuable
than averaging all responses. For example, knowing the most frequently chosen option in
a consumer preference study can inform product development.
Robustness to Outliers: the mode remains unchanged despite the presence of extreme
values, making it a stable measure when the dataset contains outliers that distort the
mean.
Note: instability of the mode: the mode becomes unstable when two or more categories are
about equally common.
(b) The mode can be misleading when the most frequent value(s) is not representative of the
overall group.
Example: Bimodal or Multimodal Distributions:
In datasets with two or more modes, the presence of multiple values that occur with the
highest frequency can lead to confusion. For example, in a test score dataset where
students cluster around two different scores (e.g., 50 and 80), the mode could suggest that
both scores are equally representative of student performance, which may not reflect the
overall distribution effectively.
4. Define the median. What are three situations in which we might want to use it?
The Median (Md): the value that divides an ordered dataset into two equal
halves. It is also the 50th percentile (P50), marking the point where 50% of
the data lies below and 50% above.
The median is preferred for skewed distributions, datasets with outliers, and ordinal data.
It is preferred when it better represents the bulk of the cases.
Explanation:
Income and Wealth Distribution: in analyzing income or wealth data, the median is
often preferred because it is less affected by extreme values (outliers). For instance, in a
dataset where most individuals earn between $30,000 and $50,000 but a few earn
millions, the mean income might give a skewed impression of the average income. The
median, however, would provide a more accurate reflection of what a typical person
earns, as it represents the middle point of the distribution.
Ordinal Data: when working with ordinal data, where the values have a specific order
but the intervals between them are not consistent, the median is a suitable measure. For
example, in a survey asking respondents to rate satisfaction on a scale of 1 to 5, the
median can indicate the central tendency of responses without assuming equal differences
between the ratings.
Skewed Distributions: in datasets that are skewed (either positively or negatively), the
median provides a better measure of central tendency than the mean. For example, in a
dataset of test scores where a few students perform exceptionally well while most
perform at a lower level, the mean might be inflated by the high scores. The median will
give a more reliable representation of the typical test score among the majority of
students.
5. What is one situation in which a median could be misleading? What might we do in such a
situation?
The median can be misleading when there are few cases around the center of the
distribution. To counter this, we can use techniques like the broadened median, nearby
percentiles, or the mid-mean, which incorporate more data for a more stable measure of
central tendency. If these options aren't viable, relying on the mode or accepting the
instability of the median may be necessary.
Explanation:
One situation in which the median could be misleading is when dealing with small samples
that have sparse data points near the center of the distribution. In such cases, minor changes
in the values of just a few observations can lead to significant shifts in the median. For example,
in a “bathtub” distribution where there are large gaps between the central cases, the median may
not reliably represent the central tendency if a few data points are added or adjusted.
What to Do in Such a Situation:
Broadened Median:
- Instead of relying solely on the median, calculate a "broadened median" by taking the average
of the median and one or two observations on either side. This approach incorporates more data
and can provide a more stable measure.
Use of Nearby Percentiles:
- Calculate the average of percentiles near the center, such as \( (P40 + P50 + P60) / 3 \). This
method utilizes additional data points surrounding the median, leading to a more stable estimate.
Mid-Mean:
- Compute the "mid-mean," which is the mean of all observations in the central half of the
distribution. This approach takes into account more data points than the median alone, which
helps to mitigate the instability.
Consider the Mode:
- If the median remains unstable and the data allows for it, consider using the mode as an
alternative measure of central tendency, especially if the data is categorical or ordinal.
Assess Sample Size:
- Whenever possible, increase the sample size to minimize the impact of sparse data points,
making the median a more reliable measure in the first place.
6. What is one situation in which we might want to use a mean, and one situation in which it
could be misleading?
The mean is useful for symmetric datasets without outliers, while it can be misleading
(unstable) in the presence of outlying cases, which, if extreme enough, can shift it greatly,
like income distributions, where alternatives like the median or trimmed mean provide a
more accurate measure of central tendency.
Explanation:
One situation where we might want to use the mean is when dealing with a dataset that is
symmetrically distributed without outliers. For example, in a study of test scores where most
students score between 70 and 90, the mean would accurately reflect the average performance of
the entire group, providing a useful measure of central tendency.
The mean can be misleading in the presence of outliers, such as in income distributions. For
instance, if a few individuals earn significantly higher incomes compared to the rest of the
population, the mean income may be inflated and not accurately represent the typical income of
the majority. In such cases, alternatives like the median or trimmed mean are preferred, as they
provide a more accurate reflection of the central tendency without being skewed by extreme
values.
7. Suppose that we have ratio data, e.g., earned income, but the sample is small and there is
reason to think that the mean is unstable. What are some alternatives?
When the mean is unstable in small samples of ratio data like earned income, alternatives
such as trimmed means, the median, mid-mean, Winsorized mean, percentile measures,
and weighted means can be utilized to provide a more reliable and representative measure
of central tendency. These methods are particularly effective in handling the influence of
outliers.
Explanation:
Trimmed Mean:
- In a trimmed mean, a specified percentage (N%) of the highest and lowest observations is
removed before recalculating the mean. This approach helps mitigate the impact of outliers while
retaining more data than the median. Commonly, between 5% and 10% is trimmed from each
end.
Median:
- The median is a measure of central tendency that divides the dataset into two equal halves. It
is not influenced by outliers, making it a preferred measure in income data where extreme values
can distort the mean. The median provides a clear interpretation of the typical income level.
Mid-Mean:
- The mid-mean calculates the average of all observations in the central half of the distribution.
This method reduces the influence of extreme values by focusing on data that are closer to the
median, providing a more stable average.
Winsorized Mean:
- Winsorizing involves replacing the extreme values in the dataset with values closer to the
center (e.g., setting the highest and lowest 1% of values to the next highest and lowest values).
This method helps reduce the effect of outliers while still providing an average.
Percentile Measures:
- Using percentile values, such as the 25th (P25), 50th (P50, or median), and 75th (P75), can
give a clearer picture of the income distribution without being affected by outliers. This approach
helps to understand the spread and typical earnings.
6. Weighted Mean:
- Assigning lower weights to cases that are further away from the central part of the
distribution can also help create a more stable average. This method allows the mean to be
adjusted based on the distribution's shape.
8. Why is the sample mean often replaced by, or supplemented by, the median for income
distributions?
The sample mean is often replaced or supplemented by the median for income
distributions because the median is not affected by outliers, is easy to interpret, and
provides a clearer picture of typical income levels
9. For skewed distributions, what often happens as we trim more cases from the ends of the
distribution?
For skewed distributions, as we trim more cases from the ends, trimmed means tend to
get closer to the median. This means that when we remove extreme values from either
side of the distribution, the average of the remaining values (the trimmed mean) shifts
closer to the middle value (the median).
Explanation:
By cutting out the outliers—whether they are very high or very low incomes—we get a clearer
picture of what most people earn. As we keep trimming away these extreme cases, we find that
the average (trimmed mean) aligns more closely with the median, which better reflects the
typical income. This process helps to smooth out the influence of unusual cases and provides a
more accurate understanding of the data.
10. In what sense are the mode, median, and mean averages?
The mode is the most common value in a dataset, indicating frequency, but it can be
misleading in datasets with many categories or when it represents only a small portion of
the data; thus, it's important to include the number or proportion of cases it reflects. The
median, as a "positional average," lies in the middle of an ordered dataset, dividing it into
two equal halves and minimizing the sum of absolute deviations from it. The mean is
calculated by summing all values and dividing by the number of observations, balancing
the data so that the total deviations from it equal zero, thus providing a central point that
minimizes overall error and squared deviations.
Explanation:
Mode: the mode is the most common or typical value in a dataset. It shows which value occurs
most frequently. However, it can sometimes be misleading, especially in datasets with many
categories or when the mode accounts for only a small portion of the data. When reporting the
mode, it’s important to include the number or proportion of cases it represents.
Median: the median is a “positional average” because it is located in the middle of an ordered
dataset. It divides the data into two equal halves. An important feature of the median is that it
minimizes the sum of absolute deviations from it, meaning that no other value can achieve a
lower total distance when you look at how far each data point is from the median.
Mean: the mean is calculated by adding all the values in a dataset and dividing by the number of
observations. It is considered an average because it balances the data by ensuring that the total of
the deviations from the mean (both above and below) equals zero. This makes it a central point
in the sense that it neither overestimates nor underestimates the data. Additionally, the mean
minimizes the sum of squared deviations, meaning it is the best measure to use when we want to
reduce the overall error in a dataset.
11. In what sense is the median the point closest to the observed data?
The median is considered the point closest to the observed data because it minimizes the
sum of absolute deviations. This means that the total distance of all data points from the
median is less than for any other value, making it the best representation of the center of
the dataset.
12. In what sense does the mean lie in the center of a distribution?
The mean lies at the center of a distribution by balancing the values on either side. It is
the point where the sum of positive deviations equals the sum of negative deviations,
meaning that it doesn't overestimate or underestimate the data. In skewed distributions,
the mean often shifts toward the tail, indicating that it reflects the influence of extreme
values.
13. What is another technical merit of the mean as a measure of central tendency?
Another technical merit of the mean is that it minimizes the sum of squared deviations
from it. This property makes the mean particularly useful for statistical analyses, as it
captures the central tendency of the data effectively and helps reduce overall error in the
dataset.
14. In relations to one another, where do the mode, median, and mean lie if we have a
continuous, single-peaked, and skewed distribution?
In a continuous, single-peaked, and skewed distribution, the mode is located at the peak
of the distribution, the median lies in the middle, and the mean is pulled toward the tail.
Said in other words, for single-peaked, skewed, and continuous distributions, the mean
will lie farthest into the long tail, followed by the median, and the mode will lie in neither
tail.
Generally, the ordering is mode < median < mean for right-skewed distributions, with the
mean being the furthest into the tail. This relationship can vary in other types of
distributions, highlighting the importance of context when analyzing these measures.