0% found this document useful (0 votes)
4 views

TP2- ML -handling outliers

Uploaded by

Anouar Belabbes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

TP2- ML -handling outliers

Uploaded by

Anouar Belabbes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

TP2 – Machine Learning

USE CASE I :
Part I : Understanding and handling missing values in the dataset
Given a dataset in the repository :

https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-
cancer-wisconsin.data
1 – write python code to load this dataset.
2- add this line of code after loading the dataset :
data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell
Size', 'Uniformity of Cell Shape', ‘Marginal Adhesion', 'Single Epithelial
Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli',
'Mitoses','Class']

3- display the different columns in this dataset.


4 – display the statistical info of this dataset.
5- display its shape
6- display the number of missing values in this dataset
7- According to the description of the dataset, the missing values are encoded as '?'
in the original data. Our first task is to convert the missing values to NaNs. We can
then count the number of missing values in each column of the data. Type the code
below to achieve this :
What is the column that contains a lot of missing values ?

8 – Try to fill out the missing values of this column with its median. Display the values before
and after this operation.

9 – What is the other common method of handling missing values ? apply it on this column
too.

10- assume the column « Sample code » is not significant in the processing, make sure you
drop it and you display the shape of the dataset afterwards.

Part II : Dealing with outliers in the dataset


1- Using the boxplot : outliers can be detected through visualisation in the boxplot :

2- What do you notice in the boxplots of columns ?


3- To handle outliers, there are many techniques : let’s explore the Z score , we can
compute the Z-score for each attribute and remove those instances containing attributes
with abnormally high or low Z-score (e.g., if Z > 3 or Z <= -3). Execute the code below to
compute the Z score

4- Now, let’s discard outliers outside that range : setting a threshold of 3,


Part III : dealing with duplicates data

Some datasets, especially those obtained by merging multiple data sources, may contain
duplicates or near duplicate instances. The term deduplication is often used to refer to the
process of dealing with duplicate data issues.

1- Let’s check the duplicates that exist in our dataset, run this code

2- How many duplicates does this dataset contain ?


3- Let’s discard them :

USE CASE II :
Dealing with outliers in the diabetes dataset :
1 - Load and display the first lines of the diabetes dataset :
2- check if the bmi column contain outliers, using boxplot ?
3-remove the outliers using this function :

Explain the code above.


4-Display the box plot, what was the change ?

5-there exist another method to handle outliers which is scatter plot, build one using the two
related variables bmi and bp :
Display the scatter plot, what do you notice ?

6-remove the outliers using this code :

Do you notice any change ?

You might also like