TP2- ML -handling outliers
TP2- ML -handling outliers
USE CASE I :
Part I : Understanding and handling missing values in the dataset
Given a dataset in the repository :
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-
cancer-wisconsin.data
1 – write python code to load this dataset.
2- add this line of code after loading the dataset :
data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell
Size', 'Uniformity of Cell Shape', ‘Marginal Adhesion', 'Single Epithelial
Cell Size', 'Bare Nuclei', 'Bland Chromatin', 'Normal Nucleoli',
'Mitoses','Class']
8 – Try to fill out the missing values of this column with its median. Display the values before
and after this operation.
9 – What is the other common method of handling missing values ? apply it on this column
too.
10- assume the column « Sample code » is not significant in the processing, make sure you
drop it and you display the shape of the dataset afterwards.
Some datasets, especially those obtained by merging multiple data sources, may contain
duplicates or near duplicate instances. The term deduplication is often used to refer to the
process of dealing with duplicate data issues.
1- Let’s check the duplicates that exist in our dataset, run this code
USE CASE II :
Dealing with outliers in the diabetes dataset :
1 - Load and display the first lines of the diabetes dataset :
2- check if the bmi column contain outliers, using boxplot ?
3-remove the outliers using this function :
5-there exist another method to handle outliers which is scatter plot, build one using the two
related variables bmi and bp :
Display the scatter plot, what do you notice ?