TP2- ML -handling outliers

Uploaded by

Anouar Belabbes

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

TP2- ML -handling outliers

Uploaded by

Anouar Belabbes

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

TP2 – Machine Learning

USE CASE I :
Part I : Understanding and handling missing values in the dataset
Given a dataset in the repository :

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-
cancer-wisconsin.data
1 – write python code to load this dataset.
2- add this line of code after loading the dataset :
data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell
Size', 'Uniformity of Cell Shape', ‘Marginal Adhesion', 'Single Epithelial
Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli',
'Mitoses','Class']

3- display the different columns in this dataset.

4 – display the statistical info of this dataset.
5- display its shape
6- display the number of missing values in this dataset
7- According to the description of the dataset, the missing values are encoded as '?'
in the original data. Our first task is to convert the missing values to NaNs. We can
then count the number of missing values in each column of the data. Type the code
below to achieve this :
What is the column that contains a lot of missing values ?

8 – Try to fill out the missing values of this column with its median. Display the values before
and after this operation.

9 – What is the other common method of handling missing values ? apply it on this column
too.

10- assume the column « Sample code » is not significant in the processing, make sure you
drop it and you display the shape of the dataset afterwards.

Part II : Dealing with outliers in the dataset

1- Using the boxplot : outliers can be detected through visualisation in the boxplot :

2- What do you notice in the boxplots of columns ?

3- To handle outliers, there are many techniques : let’s explore the Z score , we can
compute the Z-score for each attribute and remove those instances containing attributes
with abnormally high or low Z-score (e.g., if Z > 3 or Z <= -3). Execute the code below to
compute the Z score

4- Now, let’s discard outliers outside that range : setting a threshold of 3,

Part III : dealing with duplicates data

Some datasets, especially those obtained by merging multiple data sources, may contain
duplicates or near duplicate instances. The term deduplication is often used to refer to the
process of dealing with duplicate data issues.

1- Let’s check the duplicates that exist in our dataset, run this code

2- How many duplicates does this dataset contain ?

3- Let’s discard them :

USE CASE II :
Dealing with outliers in the diabetes dataset :
1 - Load and display the first lines of the diabetes dataset :
2- check if the bmi column contain outliers, using boxplot ?
3-remove the outliers using this function :

Explain the code above.

4-Display the box plot, what was the change ?

5-there exist another method to handle outliers which is scatter plot, build one using the two
related variables bmi and bp :
Display the scatter plot, what do you notice ?

6-remove the outliers using this code :

Do you notice any change ?

Environmental Science Student Edition PDF
95% (21)
Environmental Science Student Edition PDF
683 pages
25 Energy Transfer in Living Organisms-Rennel Burgos
43% (37)
25 Energy Transfer in Living Organisms-Rennel Burgos
6 pages
Nutrient Cycles POGIL ANSWER KEY Yqaw69 1
69% (13)
Nutrient Cycles POGIL ANSWER KEY Yqaw69 1
7 pages
12 Ocean Tides Explore Learning Gizmo
57% (30)
12 Ocean Tides Explore Learning Gizmo
3 pages
Richter Et Al 2024 CRB Water Budget
100% (4)
Richter Et Al 2024 CRB Water Budget
12 pages
Plate Tectonics Gizmo Form PDF
85% (13)
Plate Tectonics Gizmo Form PDF
5 pages
Student Exploration: Greenhouse Effect
70% (10)
Student Exploration: Greenhouse Effect
3 pages
Useful Phrases Describing Weather
87% (238)
Useful Phrases Describing Weather
2 pages
Thrive - Long-Term Wilderness Survival Guide Skills, Tips, and Gear For Living On The Land
100% (2)
Thrive - Long-Term Wilderness Survival Guide Skills, Tips, and Gear For Living On The Land
136 pages
The Prepper's Survival Bible - T - Richard Man
100% (1)
The Prepper's Survival Bible - T - Richard Man
128 pages
Dust Bowls of Empire
No ratings yet
Dust Bowls of Empire
218 pages
Review of The Adam and Eve Story by Chan Thomas
100% (7)
Review of The Adam and Eve Story by Chan Thomas
17 pages
Free Ebook. Human Extermination For Reptilian Replacement Behind Pandemics and World War Three
82% (11)
Free Ebook. Human Extermination For Reptilian Replacement Behind Pandemics and World War Three
366 pages
Machine Learning
100% (2)
Machine Learning
136 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Rockhound PDF
100% (4)
Rockhound PDF
31 pages
Printable Article Rocks On The Beach
No ratings yet
Printable Article Rocks On The Beach
2 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Tutorial 4
No ratings yet
Tutorial 4
8 pages
chapter3 DS
No ratings yet
chapter3 DS
17 pages
Chapter 1. Data Preparation (2)
No ratings yet
Chapter 1. Data Preparation (2)
74 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
6 pages
DS Problem Statements and Codes
No ratings yet
DS Problem Statements and Codes
21 pages
Missing Data
No ratings yet
Missing Data
14 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
DSBDL Asg 2 Write Up
No ratings yet
DSBDL Asg 2 Write Up
4 pages
Unit 3
No ratings yet
Unit 3
30 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
Unit 1
No ratings yet
Unit 1
21 pages
AI351 Lecture 1 - Data Preprocessing
No ratings yet
AI351 Lecture 1 - Data Preprocessing
8 pages
Lect 2
No ratings yet
Lect 2
54 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Data Analytics lab manual
No ratings yet
Data Analytics lab manual
47 pages
FDS Unit 2
No ratings yet
FDS Unit 2
8 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Data Quality
100% (2)
Data Quality
16 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Week 3
No ratings yet
Week 3
77 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
EXP-12_IAIML
No ratings yet
EXP-12_IAIML
13 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Handling Missing Values in Python
No ratings yet
Handling Missing Values in Python
9 pages
Phython Example
No ratings yet
Phython Example
12 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Lecture 02
No ratings yet
Lecture 02
41 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Chapter 2. Pre-Processing Data
No ratings yet
Chapter 2. Pre-Processing Data
37 pages
02_23ECE216_EDA_Pre Processing
No ratings yet
02_23ECE216_EDA_Pre Processing
16 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
3b. Data Pre-Processing
No ratings yet
3b. Data Pre-Processing
84 pages
Data Quality
No ratings yet
Data Quality
14 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
DA lab
No ratings yet
DA lab
27 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Slides on DataII
No ratings yet
Slides on DataII
26 pages
Bank Loan Case Study
No ratings yet
Bank Loan Case Study
71 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Week3- Data Preprocessing, Extraction and Preparation
No ratings yet
Week3- Data Preprocessing, Extraction and Preparation
34 pages
Ass-2 Ds
No ratings yet
Ass-2 Ds
29 pages
Handling Missing Values
No ratings yet
Handling Missing Values
4 pages
"Handling and Mitigation of Missing Data in Sensors" Course: Business Data Mining Group 13
No ratings yet
"Handling and Mitigation of Missing Data in Sensors" Course: Business Data Mining Group 13
12 pages
Data Cleaning With Python and Pandas
No ratings yet
Data Cleaning With Python and Pandas
49 pages
Missing Data
No ratings yet
Missing Data
25 pages
Pandas
No ratings yet
Pandas
4 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Lined Interjection Approach
No ratings yet
Lined Interjection Approach
7 pages
Handling Missing Values in Data Mining
No ratings yet
Handling Missing Values in Data Mining
12 pages
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Hatchet Anticipation Guide
0% (1)
Hatchet Anticipation Guide
2 pages
Ten Laws of Boundaries
No ratings yet
Ten Laws of Boundaries
1 page
Rock and Minerals of Pennsylvania
100% (1)
Rock and Minerals of Pennsylvania
37 pages
Off The Grid Survival
No ratings yet
Off The Grid Survival
22 pages
Guide To Rocks and Minerals of Florida
No ratings yet
Guide To Rocks and Minerals of Florida
71 pages
Weather Modification Programs 1978 Document
No ratings yet
Weather Modification Programs 1978 Document
784 pages
Biology Book
100% (8)
Biology Book
1,160 pages
Holt Student Edition
100% (3)
Holt Student Edition
976 pages
PDF The Shamar Prophet 1st Edition John Eckhardt download
100% (2)
PDF The Shamar Prophet 1st Edition John Eckhardt download
37 pages
Inspection Toyotascion2005
No ratings yet
Inspection Toyotascion2005
3 pages
How To Use Fibonacci
No ratings yet
How To Use Fibonacci
6 pages
Instant ebooks textbook Treating Complex Trauma in Children and Their Families: An Integrative Approach – Ebook PDF Version download all chapters
100% (5)
Instant ebooks textbook Treating Complex Trauma in Children and Their Families: An Integrative Approach – Ebook PDF Version download all chapters
61 pages
Hydrometer Paper
No ratings yet
Hydrometer Paper
7 pages
Agate
100% (1)
Agate
6 pages
A Description of Some Oregon Rocks and Minerals
No ratings yet
A Description of Some Oregon Rocks and Minerals
52 pages
Pin Out Celect
100% (1)
Pin Out Celect
1 page
XR50CX GB
No ratings yet
XR50CX GB
4 pages
Reviewer Kay Sir Jed
No ratings yet
Reviewer Kay Sir Jed
12 pages
ForeFlight - Connect With Garmin Flight Stream
No ratings yet
ForeFlight - Connect With Garmin Flight Stream
1 page
E&i Supervisor
No ratings yet
E&i Supervisor
2 pages
Ann - Lab - Ipynb - Colaboratory
No ratings yet
Ann - Lab - Ipynb - Colaboratory
7 pages
Chrome - Google Search
No ratings yet
Chrome - Google Search
1 page
Download Full CompTIA Security+ All in One Exam Guide, Fifth Edition (Exam SY0 501) 5th Edition, (Ebook PDF) PDF All Chapters
100% (1)
Download Full CompTIA Security+ All in One Exam Guide, Fifth Edition (Exam SY0 501) 5th Edition, (Ebook PDF) PDF All Chapters
47 pages
Class Prep Day13 Mubasshir Al Shahriar
No ratings yet
Class Prep Day13 Mubasshir Al Shahriar
1 page
How Can I Download Episode Videos From Voot App? - Quora
0% (1)
How Can I Download Episode Videos From Voot App? - Quora
5 pages
GtFs Builder Template
No ratings yet
GtFs Builder Template
97 pages
Class 2 CH - 5 - 6 - 7 - 8 - 9 Computer
No ratings yet
Class 2 CH - 5 - 6 - 7 - 8 - 9 Computer
8 pages
5 - Basic Telephony
No ratings yet
5 - Basic Telephony
24 pages
Synchronous &
No ratings yet
Synchronous &
11 pages
Session 7A-MC PDF
No ratings yet
Session 7A-MC PDF
29 pages
Terms and Conditions Preview
No ratings yet
Terms and Conditions Preview
11 pages
Meet Data-Centric Engineering:: Engineering Better Relationships and More Sustainable Capital Projects
No ratings yet
Meet Data-Centric Engineering:: Engineering Better Relationships and More Sustainable Capital Projects
9 pages
Requirements Engineering Good Practices
No ratings yet
Requirements Engineering Good Practices
13 pages
Iyad DubbehCV
No ratings yet
Iyad DubbehCV
4 pages
Optimization For Machine Learning
No ratings yet
Optimization For Machine Learning
45 pages
A JAVA 4
No ratings yet
A JAVA 4
2 pages
F300 Series: Manual
No ratings yet
F300 Series: Manual
92 pages
Introduction To Simulation: Discrete-Event System Simulation
No ratings yet
Introduction To Simulation: Discrete-Event System Simulation
16 pages
Nsdi23 Hwang
No ratings yet
Nsdi23 Hwang
16 pages
Project System Configuration - MFG
No ratings yet
Project System Configuration - MFG
5 pages
Python Interview Questions and Answers
No ratings yet
Python Interview Questions and Answers
15 pages
Virtual Summits 101 How To Create Your Own Virtual Summit in The Next 90 Days Even If You Have No Audience (Paul Brodie Ray Brehm (Brodie, Paul Brehm Etc.)
No ratings yet
Virtual Summits 101 How To Create Your Own Virtual Summit in The Next 90 Days Even If You Have No Audience (Paul Brodie Ray Brehm (Brodie, Paul Brehm Etc.)
53 pages
Web Design
No ratings yet
Web Design
67 pages
Nihil Uppy
No ratings yet
Nihil Uppy
3 pages
Parametric Urbanism: Jeyashree.V M.Arch 1St Year BATCH: 2019-2021
100% (1)
Parametric Urbanism: Jeyashree.V M.Arch 1St Year BATCH: 2019-2021
6 pages

TP2- ML -handling outliers

Uploaded by

TP2- ML -handling outliers

Uploaded by

TP2 – Machine Learning

3- display the different columns in this dataset.

Part II : Dealing with outliers in the dataset

2- What do you notice in the boxplots of columns ?

4- Now, let’s discard outliers outside that range : setting a threshold of 3,

2- How many duplicates does this dataset contain ?

Explain the code above.

6-remove the outliers using this code :

Do you notice any change ?

You might also like