Data Skewness and Exercises (1)
Data Skewness and Exercises (1)
1. Unimodal Distribution
A dataset is Unimodal if it has only one mode—one value that appears more
frequently than any other.
Dataset: [1, 3, 4, 5, 1, 1, 3, 4]
Takeaway: A unimodal dataset provides clear insights into the most frequent
trend and helps in making data-driven decisions.
2. Bimodal Distribution
A dataset is Bimodal if it has two distinct modes—two different values appear with
the highest frequency.
• Example: Test scores from a mixed-ability class—one group scoring high and
another scoring low.
Dataset: [2, 3, 4, 5, 3, 2, 6, 7]
✔ There are two dominant trends or peaks—indicating the presence of two most
frequently occurring values in the dataset.
✔ The dataset may have two distinct groups or subpopulations—for example, test
scores in a mixed-ability class where one group performs well and another
struggles.
✔ It suggests variability in the data—unlike unimodal distributions, bimodal data
may not follow a single normal distribution and might require separate analysis for
each mode.
✔ For categorical data, it indicates two popular choices—for example, in customer
preferences, two different product colours might be equally favoured.
✔ For numerical data, decision-making should consider both modes—since
treating the dataset as a single unit might lead to misleading conclusions.
Dataset: [3, 5, 7, 3, 5, 9, 7, 7, 9]
✔ There are multiple dominant peaks (three or more modes)—indicating that the
dataset contains several frequently occurring values.
✔ The dataset likely has multiple subgroups—for example, in employee salary
data, entry-level, mid-level, and senior employees may each form distinct groups.
✔ It suggests complex variability—multimodal data often represents different
categories, behaviours, or clusters rather than a single unified trend.
✔ For categorical data, it indicates diverse preferences—for example, in a survey
about favourite food, multiple items may be equally popular.
✔ For numerical data, standard statistical summaries may be misleading—mean
and median might not accurately represent the dataset, and cluster analysis may be
needed.
✔ Avoid misleading conclusions—If data has multiple modes, using only the mean
to describe the dataset can hide important differences. For example:
• Income distribution analysis: If data is bimodal, the mean salary may not
represent most employees accurately.
• Product sales trends: If customers prefer two distinct price points, a single
average price might not reflect real purchasing behaviour.
For each dataset below, identify whether it is Unimodal, Bimodal, or Multimodal and
specify the Mode(s):
1. [4, 6, 7, 8, 6, 4, 9, 10, 6]
2. [2, 2, 3, 3, 4, 4, 5, 6, 7, 8]
Mean−Mode=3×(Mean−Median)
This formula helps estimate Mode, Mean, or Median when two values are given.
When analysing datasets, we often compare the Mean, Median, and Mode to
determine whether the data is symmetrical or skewed. Skewness measures how
much a dataset is asymmetrical or leans towards one side.
1. Symmetric Data
• In a perfectly symmetric distribution, the Mean, Median, and Mode are equal
or nearly equal.
• Mean = 35
• Median = 35
• Mode = None
✔ The data is evenly distributed values are spread equally on both sides of the
central point, meaning there is no significant skewness.
✔ Mean, Median, and Mode are approximately equal indicating a balanced
dataset where the central value accurately represents the data.
✔ There are no extreme outliers affecting the distribution—making the mean a
reliable measure of central tendency.
✔ The dataset may follow a normal distribution—which is common in natural
phenomena like height, weight, IQ scores, and standardized test scores.
✔ Decision-making and predictions are more reliable—since the data follows a
predictable pattern without distortion from extreme values.
Takeaway: A symmetric dataset suggests a well-balanced, normally
distributed dataset, where using the mean for analysis is accurate and
meaningful.
• A positively skewed dataset has a longer tail on the right (higher values are
stretched out).
• The Mean is greater than the Median, and the Mode is the smallest value.
• This happens when a few very large values (outliers) increase the mean.
Example: Income distribution in a country. Most people earn low to average wages,
but a few billionaires significantly increase the mean.
Median = 20
Mean = 32.86
Key Point: In a right-skewed dataset, Mean > Median > Mode.
A negatively skewed dataset has a longer tail on the left (lower values are stretched out).
The Mean is less than the Median, and the Mode is the largest value.
This happens when a few very small values (outliers) decrease the mean.
Example: Exam scores where most students score high, but a few fail.
• Median = 25
• Mean = 22.14
o If the data is skewed, the Median is a better measure than the Mean
because it is less affected by outliers.
o Mode is useful for categorical data (e.g., most popular brand, favorite
color).
Final Summary
1. [10, 12, 15, 18, 20, 21, 23, 30, 35, 40, 50]
Given dataset: [23, 26, 28, 32, 33, 35, 38, 40, 41, 54]
3. Compute IQR
Given dataset: [50, 53, 50, 51, 48, 93, 90, 92, 91, 90]
A school is analysing students' exam scores. Below are two datasets representing two
different classrooms:
Classroom A Scores:
[75, 78, 80, 82, 85, 87, 89, 90, 92, 95]
Classroom B Scores:
[50, 55, 60, 65, 70, 75, 80, 85, 90, 200]
Questions:
4. How does the extreme value (200) in Classroom B affect the Mean and
Median?
• Department A: [2500, 2700, 3000, 3200, 3400, 3500, 3600, 3800, 4000, 5000]
• Department B: [2500, 2700, 3000, 3200, 3400, 3500, 3600, 3800, 4000, 15000]
• Department C: [3000, 3200, 3500, 4000, 4200, 4500, 5000, 5200, 5500, 6000]
Questions:
2. Which department has a positively skewed salary distribution? How can you
tell?
5. Which measure would an employee use when negotiating for a raise? Why?
A real estate agency collected the following data on house prices (in $1000s) in two
neighbourhoods:
Neighbourhood X Prices:
[150, 175, 200, 220, 230, 250, 270, 280, 300, 500]
Neighbourhood Y Prices:
[180, 200, 210, 220, 225, 230, 240, 250, 260, 270]
Questions:
3. If a real estate agent wanted to make prices seem more affordable, which
measure should they use? Why?
4. If a buyer wanted to know the typical price, which measure should they rely
on?
A university professor collected final exam scores from two different subjects:
Mathematics Scores:
[50, 55, 60, 65, 70, 75, 80, 85, 90, 100]
Economics Scores:
[30, 40, 50, 55, 60, 65, 70, 80, 85, 95]
Questions:
5. How does the presence of low scores (30, 40) in Economics affect the Mean
compared to Mathematics?
Dataset:
[10, 12, 15, 20, 25, 30, 35, 50, 150]
Questions:
✔ Skewness helps detect biases in data—it reveals whether extreme values are
influencing the dataset, ensuring more accurate interpretations.
✔ The Mean is useful for symmetric data but misleading in skewed distributions—it
gets affected by outliers, making the Median a better alternative in some cases.
✔ The Mode is most useful for categorical data—it helps identify the most common
category in survey responses, purchasing behaviour, or preference analysis.
✔ Businesses use Median rather than mean for pricing strategies—avoiding over-
representation of extreme high or low values in customer spending or housing markets.
By understanding skewness and the correct use of Mean, Median, and Mode,
data-driven decisions become more accurate, fair, and insightful!