0% found this document useful (0 votes)
2 views

Dgdsfa1e Presentation 2 3

Chapter 2 focuses on data wrangling, specifically the cleaning of anomalous values. It discusses various types of anomalous data, including implausible values, extreme values, incorrect formats, and duplicate records, highlighting the importance of thorough data cleaning processes. A case study illustrates the consequences of failing to identify anomalies in data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Dgdsfa1e Presentation 2 3

Chapter 2 focuses on data wrangling, specifically the cleaning of anomalous values. It discusses various types of anomalous data, including implausible values, extreme values, incorrect formats, and duplicate records, highlighting the importance of thorough data cleaning processes. A case study illustrates the consequences of failing to identify anomalies in data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Chapter 2

Data Wrangling:
Preprocessing

Section 2.3
Cleaning Anomalous Values

Copyright © 2025, Pearson Education, Inc. Slide 1


Overview

01 Anomalous Data Types

0
Implausible Values
2 and Extreme Values

Incorrect Values and Duplicate


03 Records

Copyright © 2025, Pearson Education, Inc. Slide 2


Overview

01 Anomalous Data Types

Copyright © 2025, Pearson Education, Inc. Slide 3


Types of Anomalous Values
Implausible values

Extreme values

Incorrect data formats

Duplicate records

Copyright © 2025, Pearson Education, Inc. Slide 4


Overview
0
Implausible Values
2 and Extreme Values

Copyright © 2025, Pearson Education, Inc. Slide 5


Implausible Values
Credit Score Range Credit Score Meaning

720 - 850 Excellent name credit_score


690 - 719 Good Juan 800
630 - 689 Fair Alice 2,995
300 - 629 Bad Kai 690

Kahlil -53

Copyright © 2025, Pearson Education, Inc. Slide 6


Implausible Values
Credit Score Range Credit Score Meaning

720 - 850 Excellent name credit_score


690 - 719 Good Juan 800
630 - 689 Fair Alice 2,995
300 - 629 Bad Kai 690

Kahlil -53

Copyright © 2025, Pearson Education, Inc. Slide 7


Extreme Data Values and Outliers

Extreme data values: Significantly deviate from the average or expected


range
Response: Verify data accuracy, transform them, or filter based on domain
knowledge

Outliers: May indicate errors or anomalies

Response: Use statistical techniques or remove outlier values from analysis

Copyright © 2025, Pearson Education, Inc. Slide 8


Extreme Value Likely to be an Outlier

Copyright © 2025, Pearson Education, Inc. Slide 9


Case Study: Dewey Defeats Truman
Case Study Overview:
● The 1948 presidential election
famously had a newspaper declaring
the wrong winner

● Identifying anomalous data could have


identified and corrected the error
before publication

● This case underscores the importance


of thorough data cleaning processes in
analytical work

Copyright © 2025, Pearson Education, Inc. Slide 10


Overview
Incorrect Values and Duplicate
03 Records

Copyright © 2025, Pearson Education, Inc. Slide 11


Incorrect Data Formats
Incorrect data formats:
• Numerical values stored as text
“seventeen” instead of 17

• Inconsistent naming conventions


“Goldfisg” instead of “Goldfish”

Deletion is a last resort

Copyright © 2025, Pearson Education, Inc. Slide 12


Identifying and Removing Duplicate
Records

1 2 3
Identify potential Compare Keep only one copy
duplicate records identification fields

Copyright © 2025, Pearson Education, Inc. Slide 13


Summary

01 Anomalous Data Types

0
Implausible Values and
2 Extreme Values

Incorrect Values and Duplicate


03 Records

Copyright © 2025, Pearson Education, Inc. Slide 14

You might also like