Boxplot (4) (1)
Boxplot (4) (1)
22-CSE-09, 22-CSE-12,
22-CSE-15, 22-CSE-18
Index
1 2 3 4 5
Introduction Parts of Create and Applications Summary
of Boxplot Boxplot Interpret of Boxplot
Boxplot
2 Boxplot
Introduction:
What is Boxplot?
A boxplot (or box-and-whisker plot) is a
standardized way to display data distribution
based on a five-number summary:
• Minimum
• Maximum
• Q1
• Q3
• Median
3
Why Use Boxplots?
Uses:
• Exploratory Data Analysis (EDA).
• Preprocessing data for machine learning.
4 Boxplot
Comparison with other plots:
5 Boxplot
Example: Analyzing customer
ages for a retail store.
6 Boxplot
7 Presentation title
3. Whiskers
Lines extending from the box to the smallest and largest data points
within a certain range.
Formula Tip:
• Whiskers go up to the values that are not considered outliers.
• Lower Whisker: smallest data point ≥ (Q1 − 1.5 × IQR)
• Upper Whisker: largest data point ≤ (Q3 + 1.5 × IQR)
IQR (Interquartile Range) = Q3 − Q1
Why Whiskers? They help us see the spread of most of the data.
4. Outliers
Data points far outside the whiskers (unusually high/low values).
In a boxplot, they’re usually small dots or stars.
8 Boxplot
9 Boxplot
Creating and
Interpreting a
Boxplot
Presentation title 10
What we'll cover
• Content:
o Steps to create a boxplot
o How to interpret a boxplot:
Symmetrical vs. skewed data
Identifying outliers
Comparing multiple boxplots
o Example with numbers and a boxplot
11 Presentation title
Creating a Boxplot
1. Collect Data: Gather a numerical dataset (e.g., test
scores, sales figures).
2. Order Data: Arrange data in ascending order.
3. Find Key Values:
• Median (Q2): Middle value of the dataset.
• First Quartile (Q1): Median of the lower half.
• Third Quartile (Q3): Median of the upper half.
• Interquartile Range (IQR): Q3 - Q1.
• Whiskers: Extend to the smallest/largest values within 1.5 * IQR
from Q1/Q3.
• Outliers: Values beyond the whiskers.
12 Presentation title
4. Draw the Boxplot:
Code to create boxplot:
13 Presentation title
Interpreting a Boxplot -
Symmetry vs. Skewness
• Symmetrical Data: Median is centered in the box,
whiskers are equal length.
o Indicates a balanced distribution (e.g., normal distribution).
• Skewed Data:
o Right Skew (Positive): Longer upper whisker, median closer to
Q1.
o Left Skew (Negative): Longer lower whisker, median closer to
Q3.
• Why It Matters: Skewness affects data analysis and
model assumptions.
14 Presentation title
15 Presentation title
Interpreting a Boxplot -
Outliers
• What Are Outliers?: Data points below Q1 - 1.5 * IQR or
above Q3 + 1.5 * IQR.
• How to Spot Them: Marked as dots or stars outside the
whiskers in a boxplot.
• Why They Matter:
• May indicate errors, anomalies, or significant variations.
• Critical in data mining for fraud detection or quality control.
16 Presentation title
Interpreting a Boxplot -
Comparing Boxplots
• Purpose: Compare distributions across groups (e.g., sales
by region).
• How to Read:
• Compare medians: Higher/lower central tendency.
• Compare IQRs: Spread of the middle 50% of data.
• Compare whiskers/outliers: Range and anomalies.
• Example: Boxplots of test scores for different classes.
17 Presentation title
Applications of Box Plot
•Outlier
Detection:
•Data
•Comparative
Analysis:
When
Distribution
in DM
Box plots are
excellent for
Visualization:
Box plots
comparing
multiple
datasets (e.g.,
provide a five-
identifying number sales across
outliers in a summary regions or
dataset, (minimum, Q1, performance
median, Q3, across
which can be
maximum), models), box
crucial in plots make it
allowing data
fraud scientists to easy to
detection or understand the compare
data distribution and medians,
spread of data spreads, and
cleaning
quickly. detect outliers
processes. across
18 Presentation title
categories.
Importance of Box Plot in
DM
• Box plots condense large datasets into a simple graphical
representation, aiding quick understanding and decision-making.
• Helps in selecting or transforming variables based on their
distribution, variability, and presence of outliers.
• They provide a clear way to present statistical summaries to
stakeholders, especially in dashboards or reports.
• Unlike mean and standard deviation, box plots use median and
interquartile range, making them more robust to noise and outliers.
19 Presentation title
Practical Scenario: Fraud
Scenario: Credit CardDetection
Fraud Detection
Use of Box Plot:
A bank analyzes of customers to detect potential fraud activities. daily
•A box plot isamounts
transaction created for each Insight:
customer showing their daily •These outliers are far from the
transaction amounts over a typical spending behavior and
month. may indicate unauthorized
•Most customers show consistent transactions.
spending patterns (tight Action Taken:
interquartile ranges). •The system flags the account
•For one customer, the box plot for review. The fraud team
shows: investigates and confirms
• A low median transaction fraudulent activity. The
(e.g., $50) customer is notified, and the
• Multiple outliers above card is deactivated.
20
$2000 Presentation title
Real World Example
Student Math Test Scores
Suppose a teacher wants to visualize the distribution of Math scores out
of 100 for a class of 30 students. The scores are:
45, 50, 52, 55, 58, 60, 60, 61, 62, 63, 65, 65, 66, 67, 68, 70, 71, 72, 74,
75, 76, 78, 80, 82, 85, 88, 90, 92, How
94, 95it's useful
Box plot would show: •Teachers can quickly spot outliers
(e.g., very low or very high
• Minimum score: 45
scores).
• First quartile (Q1): ~62 •They can compare this box plot
• Median (Q2): ~70 to other classes to assess
• Third quartile (Q3): ~82 performance variability.
•It helps identify whether the class
• Maximum score: 95
distribution is skewed or
21
symmetric.
Presentation title
22 Presentation title