0% found this document useful (0 votes)
9 views

Data Skewness and Exercises (1)

The document explains the concepts of unimodal, bimodal, and multimodal distributions, detailing their characteristics and implications for data analysis. It also covers the importance of understanding skewness in datasets, highlighting symmetric, positively skewed, and negatively skewed data, and their effects on central tendency measures. Additionally, it includes exercises for identifying modes, applying Pearson's formula, and analyzing data distributions.

Uploaded by

salmamaher2323
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Data Skewness and Exercises (1)

The document explains the concepts of unimodal, bimodal, and multimodal distributions, detailing their characteristics and implications for data analysis. It also covers the importance of understanding skewness in datasets, highlighting symmetric, positively skewed, and negatively skewed data, and their effects on central tendency measures. Additionally, it includes exercises for identifying modes, applying Pearson's formula, and analyzing data distributions.

Uploaded by

salmamaher2323
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1.

Understanding Mode and Skewed Data

1. Unimodal Distribution

A dataset is Unimodal if it has only one mode—one value that appears more
frequently than any other.

Characteristics of Unimodal Data:

• The data has one peak (one dominant frequency).

• The distribution is simple and easy to interpret.

Example 1 (Unimodal Data):

Dataset: [1, 3, 4, 5, 1, 1, 3, 4]

• Mode: 1 (most frequent value, appears 3 times)

What Does It Mean If My Dataset is Unimodal?

✔ There is a single dominant trend or peak—one value appears most frequently,


indicating a clear central tendency.
✔ The dataset is more predictable—since there is a clear most common value,
future data points may follow a similar pattern.
✔ It suggests homogeneity—values are clustered around one common mode,
meaning fewer distinct subgroups.
✔ For numerical data, it may indicate a normal distribution—if the mean,
median, and mode are close, the dataset might follow a bell curve.
✔ For categorical data, it highlights the most common category—helpful in
market research, surveys, and consumer behaviour analysis.

Takeaway: A unimodal dataset provides clear insights into the most frequent
trend and helps in making data-driven decisions.
2. Bimodal Distribution

A dataset is Bimodal if it has two distinct modes—two different values appear with
the highest frequency.

Characteristics of Bimodal Data:

• There are two peaks in the data.

• It may suggest two different groups within the dataset.

• Example: Test scores from a mixed-ability class—one group scoring high and
another scoring low.

Example 2 (Bimodal Data):

Dataset: [2, 3, 4, 5, 3, 2, 6, 7]

• Modes: 2 and 3 (both appear twice).

What Does It Mean If My Dataset is Bimodal?

✔ There are two dominant trends or peaks—indicating the presence of two most
frequently occurring values in the dataset.
✔ The dataset may have two distinct groups or subpopulations—for example, test
scores in a mixed-ability class where one group performs well and another
struggles.
✔ It suggests variability in the data—unlike unimodal distributions, bimodal data
may not follow a single normal distribution and might require separate analysis for
each mode.
✔ For categorical data, it indicates two popular choices—for example, in customer
preferences, two different product colours might be equally favoured.
✔ For numerical data, decision-making should consider both modes—since
treating the dataset as a single unit might lead to misleading conclusions.

Takeaway: A bimodal dataset indicates the existence of two dominant


patterns or groups, requiring careful analysis to understand the underlying
factors.
3. Multimodal Distribution

A dataset is Multimodal if it has three or more modes—three or more values appear


with the highest frequency.

Characteristics of Multimodal Data:

• The data has multiple peaks.

• This can indicate three or more subgroups in the dataset.

• Example: Age distribution in a workplace (young employees, mid-career


professionals, and senior employees).

Example 3 (Multimodal Data):

Dataset: [3, 5, 7, 3, 5, 9, 7, 7, 9]

• Modes: 3, 5, and 7 (all appear three times).

What Does It Mean If My Dataset is Multimodal?

✔ There are multiple dominant peaks (three or more modes)—indicating that the
dataset contains several frequently occurring values.
✔ The dataset likely has multiple subgroups—for example, in employee salary
data, entry-level, mid-level, and senior employees may each form distinct groups.
✔ It suggests complex variability—multimodal data often represents different
categories, behaviours, or clusters rather than a single unified trend.
✔ For categorical data, it indicates diverse preferences—for example, in a survey
about favourite food, multiple items may be equally popular.
✔ For numerical data, standard statistical summaries may be misleading—mean
and median might not accurately represent the dataset, and cluster analysis may be
needed.

Takeaway: A multimodal dataset signals multiple underlying patterns or


groups, requiring segmented analysis for better insights.
Why Does This Matter?

Understanding Unimodal, Bimodal, and Multimodal data is crucial in data analysis


because it helps us:

✔ Detect patterns in datasets—Knowing how values are distributed allows us to


identify trends and make accurate predictions in various fields like economics,
healthcare, and social sciences.

✔ Identify different groups within data—Bimodal and multimodal distributions


indicate the presence of subgroups within the dataset. This is essential in:

• Education: Identifying high- and low-performing student groups.

• Business: Understanding different customer segments with varying buying


patterns.

• Medical Research: Detecting differences between patient groups with


different responses to treatment.

✔ Choose appropriate statistical models—Some machine learning and statistical


models assume that the data follows a single normal distribution (unimodal).
However, if the data is bimodal or multimodal, adjustments must be made, such
as:

• Using multiple distributions for better prediction models (e.g., mixture


models).

• Segmenting data into subgroups before analysis instead of treating it as a


single dataset.

✔ Avoid misleading conclusions—If data has multiple modes, using only the mean
to describe the dataset can hide important differences. For example:

• Income distribution analysis: If data is bimodal, the mean salary may not
represent most employees accurately.

• Product sales trends: If customers prefer two distinct price points, a single
average price might not reflect real purchasing behaviour.

Takeaway: Recognizing Unimodal, Bimodal, and Multimodal distributions is


essential for accurate data interpretation, better decision-making, and choosing
the right statistical tools.
Exercise 1: Identifying Modes

For each dataset below, identify whether it is Unimodal, Bimodal, or Multimodal and
specify the Mode(s):

1. [4, 6, 7, 8, 6, 4, 9, 10, 6]

2. [2, 2, 3, 3, 4, 4, 5, 6, 7, 8]

3. [11, 15, 13, 11, 17, 15, 17, 18, 19]

2. Using Pearson’s Formula to Find Missing Values

Pearson’s formula for skewed data:

Mean−Mode=3×(Mean−Median)

This formula helps estimate Mode, Mean, or Median when two values are given.

Exercise 2: Using Pearson’s Formula

Solve the following problems using Pearson’s formula:

1. Given: Mean = 50, Median = 45. Find the Mode.

2. Given: Median = 85, Mode = 75. Find the Mean.

3. Given: Mean = 90, Mode = 100. Find the Median.

3. Understanding Symmetric vs. Skewed Data

When analysing datasets, we often compare the Mean, Median, and Mode to
determine whether the data is symmetrical or skewed. Skewness measures how
much a dataset is asymmetrical or leans towards one side.

1. Symmetric Data

• In a perfectly symmetric distribution, the Mean, Median, and Mode are equal
or nearly equal.

• The data is evenly distributed on both sides of the centre.


Example Data (Symmetric Distribution):
[20, 25, 30, 35, 40, 45, 50]

• Mean = 35

• Median = 35

• Mode = None

What Does It Mean If My Dataset is Symmetric?

✔ The data is evenly distributed values are spread equally on both sides of the
central point, meaning there is no significant skewness.
✔ Mean, Median, and Mode are approximately equal indicating a balanced
dataset where the central value accurately represents the data.
✔ There are no extreme outliers affecting the distribution—making the mean a
reliable measure of central tendency.
✔ The dataset may follow a normal distribution—which is common in natural
phenomena like height, weight, IQ scores, and standardized test scores.
✔ Decision-making and predictions are more reliable—since the data follows a
predictable pattern without distortion from extreme values.
Takeaway: A symmetric dataset suggests a well-balanced, normally
distributed dataset, where using the mean for analysis is accurate and
meaningful.

2. Positively Skewed Data (Right-Skewed)

• A positively skewed dataset has a longer tail on the right (higher values are
stretched out).

• The Mean is greater than the Median, and the Mode is the smallest value.

• This happens when a few very large values (outliers) increase the mean.

Example: Income distribution in a country. Most people earn low to average wages,
but a few billionaires significantly increase the mean.

Example Data (Positively Skewed):

[5, 10, 15, 20, 30, 50, 100]

Mode = 5 (most frequent)

Median = 20

Mean = 32.86
Key Point: In a right-skewed dataset, Mean > Median > Mode.

3. Negatively Skewed Data (Left-Skewed)

A negatively skewed dataset has a longer tail on the left (lower values are stretched out).

The Mean is less than the Median, and the Mode is the largest value.

This happens when a few very small values (outliers) decrease the mean.

Example: Exam scores where most students score high, but a few fail.

Example Data (Negatively Skewed):


[5, 15, 20, 25, 30, 30, 30]

• Mode = 30 (most frequent)

• Median = 25

• Mean = 22.14

Key Point: In a left-skewed dataset, Mean < Median < Mode.

Why is This Important?

1. Summarizing Data Correctly

o If the data is skewed, the Median is a better measure than the Mean
because it is less affected by outliers.

2. Interpreting Real-World Data

o Income, exam scores, and house prices are often skewed.

o Symmetric datasets are often seen in normally distributed data (e.g.,


IQ scores, height, weight).

3. Choosing the Right Analysis

o Mean works best for symmetric data.

o Median is better for skewed data.

o Mode is useful for categorical data (e.g., most popular brand, favorite
color).
Final Summary

✔ Symmetric Data: Mean ≈ Median ≈ Mode


✔ Positively Skewed Data: Mean > Median > Mode (Right-Skewed)
✔ Negatively Skewed Data: Mean < Median < Mode (Left-Skewed)

Exercise 3: Skewness Identification

For each dataset below, determine whether it is Symmetric, Positively Skewed, or


Negatively Skewed:

1. [10, 12, 15, 18, 20, 21, 23, 30, 35, 40, 50]

2. [5, 6, 7, 8, 9, 10, 11, 12, 13]

3. [2, 3, 4, 5, 6, 50, 51, 52]

4. Measuring Dispersion (Range, Quartiles, Boxplots)

Range = Maximum value - Minimum value

Quartiles divide data into four equal parts:

• Q1 (Lower Quartile) = 25% of the data below this value

• Q2 (Median) = 50% of the data below this value

• Q3 (Upper Quartile) = 75% of the data below this value

Interquartile Range (IQR) = Q3 - Q1 (measures spread of middle 50% of data).

Boxplots visualize quartiles and detect outliers.


Exercise 4: Compute Range, Quartiles, and IQR

Given dataset: [23, 26, 28, 32, 33, 35, 38, 40, 41, 54]

1. Find the Range

2. Identify Q1, Q2 (Median), and Q3

3. Compute IQR

Exercise 5: Five-Number Summary and Boxplot

Given dataset: [50, 53, 50, 51, 48, 93, 90, 92, 91, 90]

1. Compute the five-number summary (Min, Q1, Median, Q3, Max).

2. describe where outliers might be

Exercise 6: Choosing the Right Measure of Central Tendency

A school is analysing students' exam scores. Below are two datasets representing two
different classrooms:

Classroom A Scores:
[75, 78, 80, 82, 85, 87, 89, 90, 92, 95]

Classroom B Scores:
[50, 55, 60, 65, 70, 75, 80, 85, 90, 200]

Questions:

1. Calculate the Mean, Median, and Mode for both classrooms.

2. Which classroom has a more symmetric distribution?

3. Which measure (Mean or Median) is a better representation of student


performance in each classroom? Why?

4. How does the extreme value (200) in Classroom B affect the Mean and
Median?

5. If you were a teacher reporting student performance to parents, which


measure would you use for each classroom? Explain.
Exercise 7: Income Distribution Analysis

A company is analysing the monthly salaries of employees across three departments:

• Department A: [2500, 2700, 3000, 3200, 3400, 3500, 3600, 3800, 4000, 5000]

• Department B: [2500, 2700, 3000, 3200, 3400, 3500, 3600, 3800, 4000, 15000]

• Department C: [3000, 3200, 3500, 4000, 4200, 4500, 5000, 5200, 5500, 6000]

Questions:

1. Find the Mean, Median, and Mode for each department.

2. Which department has a positively skewed salary distribution? How can you
tell?

3. Which department has the most symmetric salary distribution?

4. If the company wanted to advertise "average salary," which measure should


they use to be accurate and fair?

5. Which measure would an employee use when negotiating for a raise? Why?

Exercise 8: Detecting Skewness in Real Estate Prices

A real estate agency collected the following data on house prices (in $1000s) in two
neighbourhoods:

Neighbourhood X Prices:
[150, 175, 200, 220, 230, 250, 270, 280, 300, 500]

Neighbourhood Y Prices:
[180, 200, 210, 220, 225, 230, 240, 250, 260, 270]

Questions:

1. Calculate the Mean, Median, and Mode for both neighbourhoods.

2. Which neighbourhood has a right-skewed (positively skewed) distribution?


Explain.

3. If a real estate agent wanted to make prices seem more affordable, which
measure should they use? Why?

4. If a buyer wanted to know the typical price, which measure should they rely
on?

5. What does the presence of an outlier (500) in Neighbourhood X tell us about


the pricing trends?
Exercise 9: Comparing Test Scores Across Subjects

A university professor collected final exam scores from two different subjects:

Mathematics Scores:
[50, 55, 60, 65, 70, 75, 80, 85, 90, 100]

Economics Scores:
[30, 40, 50, 55, 60, 65, 70, 80, 85, 95]

Questions:

1. Find the Mean, Median, and Mode for both subjects.

2. Which subject has a more symmetrical score distribution? How do you


know?

3. Which measure (Mean or Median) best represents student performance for


each subject? Why?

4. If the university administration wanted to compare subject difficulty based


on average scores, which measure should they use?

5. How does the presence of low scores (30, 40) in Economics affect the Mean
compared to Mathematics?

Exercise 10: Impact of Skewness in Business Decision-Making

A restaurant owner is analysing monthly customer spending data:

Dataset:
[10, 12, 15, 20, 25, 30, 35, 50, 150]

Questions:

1. Find the Mean, Median, and Mode.

2. Is this dataset positively skewed, negatively skewed, or symmetric? Explain.

3. Why might the Mean be misleading for determining "average customer


spending"?

4. If the restaurant wants to set a reasonable price range for promotions,


should they use the Mean or Median? Why?

5. If the owner wants to attract high-spending customers, which measure


should they advertise?
Key Takeaways on Skewness, Mean, Median, and Mode

✔ Skewness helps detect biases in data—it reveals whether extreme values are
influencing the dataset, ensuring more accurate interpretations.

✔ The Mean is useful for symmetric data but misleading in skewed distributions—it
gets affected by outliers, making the Median a better alternative in some cases.

✔ The Median is resistant to extreme values—it provides a better measure of central


tendency in skewed distributions, such as income, house prices, and test scores.

✔ The Mode is most useful for categorical data—it helps identify the most common
category in survey responses, purchasing behaviour, or preference analysis.

✔ Businesses use Median rather than mean for pricing strategies—avoiding over-
representation of extreme high or low values in customer spending or housing markets.

✔ Understanding skewness helps avoid misleading conclusions—for example,


reporting the mean salary of a company might make incomes seem higher due to
executive salaries.

✔ Financial analysts use skewness to detect risk and anomalies—a right-skewed


stock return distribution suggests high reward potential but also high risk.

✔ Healthcare professionals rely on median values—e.g., median survival rates in


medical studies offer a more reliable indicator than the mean when dealing with
extreme values.

✔ Educational institutions analyze exam performance using skewness—to


determine if test difficulty was balanced or if grading adjustments are necessary.

✔ Marketers use mode to identify consumer preferences—for example, the most


frequently purchased product size, color, or brand helps optimize inventory and
promotions.

By understanding skewness and the correct use of Mean, Median, and Mode,
data-driven decisions become more accurate, fair, and insightful!

You might also like