0% found this document useful (0 votes)
10 views

Chapter 1 Data Preparation VF

Uploaded by

djangotest801
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Chapter 1 Data Preparation VF

Uploaded by

djangotest801
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Chapter 1 : Data preparation

Machine Learning Team


UP GL-BD
Learning outcomes
At the end of this module, the student will be able to:

• Handle the various Python libraries intended for data science, namely Numpy, scipy, pandas, sklearn.

• Define the data preparation process:


• Loading data as DataFrames

• Data visualization
• Data cleaning
• Data mining
• Data transformation
• Data storage…

• Generate reduced-dimensional data using the “Principal Component Analysis” method

Machine Learning Basics


2
ESPRIT 2024/2025
Introduction
• Predictive modeling projects involve learning from data.
• The data refers to examples or cases from the field that characterizes the
problem to be solved.
• On a Machine learning project, raw data typically cannot be used directly.
• This is because of reasons such as:
• Machine learning algorithms require data to be numbers.

• Some machine learning algorithms impose requirements on the data.

• Statistical noise and errors in the data may need to be corrected…

Machine Learning Basics


3
ESPRIT 2024/2025
What is Machine Learning?

• Machine learning is a subset of artificial intelligence.

• Machines have the ability to learn without being explicitly programmed but we need to provide a
mathematical model, a database and a set of features.

• Computer program learns from examples with respect to a given mathematical model.

Machine Learning Basics


04/11/2024 4
ESPRIT 2024/2025
What is Machine Learning?

Machine Learning Basics


5
ESPRIT 2024/2025
What is Machine Learning?
• Supervised machine learning:
- learning from a labeled dataset.
- Use mathematical model.
• Unsupervised machine learning:
- Blind Analysis
- Used when the dataset is not labeled.
- Use mathematical models.
- Use similarity measures.
• Reinforcement machine learning:
- Used when there is no training dataset.

Machine Learning Basics


6
ESPRIT 2024/2025
Machine Learning: How?
Supervised Learning

Training Data

Model

Training Labels

Test Data Prediction

Test Labels Evaluation


Machine Learning Basics
7
ESPRIT 2024/2025
Algorithmes en Machine learning
• Supervised Learning models

• Regression: Used to predict a continuous


value

• Classification: Used to predict discrete value


(a discrete class label output for an example)

• Unsupervised Learning models

• Clustering: Based on similarity, grouping a set


of objects the same group

• Dimension Reduction

• Association rules

Machine Learning Basics


8
ESPRIT 2024/2025
Data Science
Life cycle

Machine Learning Basics


9
ESPRIT 2024/2025
Data Science Driven Projects
Methodologies

Machine Learning Basics


10
ESPRIT 2024/2025
CRISP-DM methodology

Machine Learning Basics


11
ESPRIT 2024/2025
IBM Master Plan methodology

Machine Learning Basics


12
ESPRIT 2024/2025
Microsoft TDSP methodology

Machine Learning Basics


13
ESPRIT 2024/2025
Interest
• The raw data must be pre-processed prior to being used to fit and evaluate a machine learning model.
• This step in a predictive modeling project is referred to as “data preparation“

Machine Learning Basics


14
ESPRIT 2024/2025
Data Preparation

Machine Learning Basics


15
ESPRIT 2024/2025
Standard tasks of data preparation
Dimensionality Reduction

Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis

Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.

Machine Learning Basics


16
ESPRIT 2024/2025
Standard tasks of data preparation
Dimensionality Reduction

Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis PCA

Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.

Machine Learning Basics


17
ESPRIT 2024/2025
Data Cleaning

“Data cleaning is used to refer to all kinds of tasks and activities to detect and repair errors in the data.”

Page xiii, Data Cleaning, 2019.

“Cleaning up your data is not the most glamourous of tasks, but it’s an essential part of data wrangling.
[…] Knowing how to properly clean and assemble your data will set you miles apart from others in your
field.”

Page 149, Data Wrangling with Python, 2016..

Machine Learning Basics


18
ESPRIT 2024/2025
Data Cleaning

• Data cleaning includes simple tasks such as:


• Define normal data.
• Removing duplicate rows, redundant and irrelevant columns.
• Identify outliers.
• Dealing with missing values.

Machine Learning Basics


19
ESPRIT 2024/2025
Data Cleaning: Redundant samples

Machine Learning Basics


20
ESPRIT 2024/2025
Data Cleaning: Redundant features

# Quick examples:

# Drop duplicate columns


df2 = df.T.drop_duplicates().T

# Remove duplicate columns pandas DataFrame


df2 = df.loc[:,~df.columns.duplicated()]

# Use DataFrame.columns.duplicated() to drop duplicate columns


duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)

Machine Learning Basics


21
ESPRIT 2024/2025
Data Cleaning: Outliers

“We will generally define outliers as samples that are exceptionally far from the mainstream of the
data.”

Page 33, Applied Predictive Modeling, 2013..

Outliers causes:
• Measurement or input error.
• Data corruption.
• True outlier observation…

Machine Learning Basics


22
ESPRIT 2024/2025
Data Cleaning: Outliers
• Some methods to detect outliers:

Boxplot Histogram
Scatter plot

Machine Learning Basics


23
ESPRIT 2024/2025
Data Cleaning: Outliers
• Some methods to handle outliers:

• Drop the outlier records.

• Cap your outliers’ data (min, max).

• Assign a new value…

Machine Learning Basics


24
ESPRIT 2024/2025
Data cleaning: Missing value
What is missing data?

• Missing data are defined as not available values, and that would be
meaningful if observed.
• Missing data can be anything from missing sequence, incomplete
feature, files missing, information incomplete, data entry error etc.

• Filling in missing values with data is called data imputation.

Machine Learning Basics


25
ESPRIT 2024/2025
Data cleaning: Missing value

Some data imputation methods:

• Delete individuals with missing data

• Replace missing data with a fixed value

• Replace missing data with a decision tree

• Replace missing data with nearest values

• Replace missing data with dedicated algorithms…

Machine Learning Basics


26
ESPRIT 2024/2025
Data cleaning: Missing value
A popular approach for data imputation is to calculate a statistical value for each column (such as a mean) and
replace all missing values for that column with the statistic.

Strategy « mean » « median » « most_frequent» « constant»


Definitio Replace missing values using Replace missing values Replace the missing value Replace missing values with
n the mean along each column using the median along using the most frequent fill_value.
each column value along each column • fill_value=0 when imputing
numeric data
• fill_value=missing_value” for
strings or object data types
Data numeric data numeric data numeric data/Strings numeric data/Strings
type
Attention: Missing values must be marked with NaN

Machine Learning Basics


27
ESPRIT 2024/2025
Data cleaning: Missing value
Python

Méthode 1:
« mean » sklearn.impute.SimpleImputer(*, missing_values=nan, strategy=‘mean', fill_value=None, verbose='depre
strategy cated', copy=True, add_indicator=False)
Méthode 2:
df.fillna(df.mean())

«median» Méthode 1:
strategy sklearn.impute.SimpleImputer(*, missing_values=nan, strategy=‘median', fill_value=None, verbose='dep
recated', copy=True, add_indicator=False)
Méthode 2:
df.fillna(df.median())
« most_fre Méthode 1:
quent » sklearn.impute.SimpleImputer(*, missing_values=nan, strategy=‘most_frequent', fill_value=None, verbo
strategy se='deprecated', copy=True, add_indicator=False)

Méthode 2:
df['columnName'] = df['columnName'].fillna(df[‘columnName'].mode()[0])

« constant sklearn.impute.SimpleImputer(*, missing_values=nan, strategy='constant', fill_value=None, verbose='d


» strategy eprecated', copy=True, add_indicator=False)

Machine Learning Basics


28
ESPRIT 2024/2025
Standard tasks of data preparation
Dimensionality Reduction

Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis

Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.

Machine Learning Basics


29
ESPRIT 2024/2025
Data Transformation

• Data transformation: is used to remove noise from a dataset.

• Noise is any distorted and meaningless data in a data set.

Machine Learning Basics


30
ESPRIT 2024/2025
Data Transformation: Categorical features (1/3)

• Price • Number of persons. • Small, Medium, large • Woman, man.


• Age • Number of days • satisfied, dissatisfied, • Red, black, blue…
extremely dissatisfied

Machine Learning Basics


31
ESPRIT 2024/2025
Data Transformation: Categorical features (2/3)

• Transforming categorical data is an essential step during data preprocessing. sklearn’s machine
learning library require the input dataset to always have numeric values it does not support
categorical data.

• It is necessary to convert categorical features to a numerical representation.

• Before you start transforming your data, it is important to figure out if the feature you’re working
on is ordinal (as opposed to nominal). An ordinal feature is best described as a feature with
ordered categories.

Machine Learning Basics


32
ESPRIT 2024/2025
Data Transformation: Categorical features (3/3)
• Once you know what type of categorical data you’re working on, you can pick a suiting
transformation tool. In sklearn that will be:
o A OrdinalEncoder or LabelEncoder for ordinal data,

o A OneHotEncoder for nominal data.

Machine Learning Basics


33
ESPRIT 2024/2025
Data transformation: Feature scaling
• Feature Scaling is a technique to standardize the independent features
present in the data in a fixed range.

• It is performed during the data pre-processing to handle highly varying


magnitudes or values or units.

• If feature scaling is not done, then a machine learning algorithm tends to


weigh greater values, higher and consider smaller values as the lower
values, regardless of the unit of the values.
Machine Learning Basics
34
ESPRIT 2024/2025
Data transformation: Feature scaling
Standardization

Machine Learning Basics


35
ESPRIT 2024/2025
Data Transformation: Feature scaling
Normalization
• Some normalization methods are :
o Maximum Absolute Scaling

o Min-max normalization

o Decimal scaling
o Z-score normalization…

Machine Learning Basics


36
ESPRIT 2024/2025
Standard tasks of data preparation
Dimensionality Reduction

Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis

Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.

Machine Learning Basics


37
ESPRIT 2024/2025
Feature engineering (1/2)
• Feature engineering is the process of transforming raw data into features that better
represent the underlying problem to the predictive models, resulting in improved model
accuracy on unseen data

• Example of Feature Engineering techniques:


o Creation of Features: Sum, subtraction, average, min, max, product, quotient of the group of
features.
o Extracting Features from a text
o Topic extraction: extract main topics from a text…
o Extracting features from an image

Machine Learning Basics


38
ESPRIT 2024/2025
Feature engineering (2/2)

Machine Learning Basics


39
ESPRIT 2024/2025
Standard tasks of data preparation
Dimensionality Reduction

Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis

Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.

Machine Learning Basics


40
ESPRIT 2024/2025
Dimensionality Reduction
• Dimensionality reduction, or dimension reduction, is the transformation of data from a
high-dimensional space into a low-dimensional space so that the low-dimensional
representation retains some meaningful properties of the original data.

Machine Learning Basics


41
ESPRIT 2024/2025
Standard tasks of data preparation
Dimensionality Reduction

Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis

Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.

Machine Learning Basics


42
ESPRIT 2024/2025
Feature Selection
• Feature Selection: is the process of selecting the most important features to input in machine learning
algorithms

Reduce overfitting

Machine Learning Basics


43
ESPRIT 2024/2025
Feature Selection : Methods

Machine Learning Basics


44
ESPRIT 2024/2025
Feature Selection : Methods
• « Filter » vs. « Wrapper »:

• Wrapper: choose in an iterative

way the characteristics that give

the best performing model.

• Filter: Assign a score to each input

feature to select the best

performing features.

Machine Learning Basics


45
ESPRIT 2024/2025
Standard tasks of data preparation
Dimensionality Reduction

Feature Principal
Data Cleaning Data Transforms Feature Selection component
Engineering
Analysis

Identifying and Changing the scale Deriving new Identifying those Creating
correcting mistakes or distribution of variables from input variables that compact
or errors in the variables. available data. are most relevant projections of
data. to the task. the data.

Machine Learning Basics


46
ESPRIT 2024/2025
Dimensionality reduction advantages

• A lower number of dimensions in data means less training time and less computational resources
and increases the overall performance of machine learning algorithms

• Dimensionality reduction avoids the problem of overfitting

• Dimensionality reduction is extremely useful for data visualization

• Dimensionality reduction removes noise in the data

• Dimensionality reduction can be used for image compression

• Dimensionality reduction can be used to transform non-linear data into a linearly-separable


form…
Machine Learning Basics
47
ESPRIT 2024/2025
Dimensionality Reduction: PCA

• Principal component analysis (PCA) is a statistical procedure that is used to


reduce the dimensionality.

• PCA uses an orthogonal transformation to convert a set of observations of


possibly correlated variables into a set of values of linearly uncorrelated variables
called principal components.

• PCA is often used as a dimensionality reduction technique.

Machine Learning Basics


48
ESPRIT 2024/2025
How PCA works?
PCA works on a process called Eigenvalue Decomposition of a covariance matrix of a data set.

The steps are as follows:

• First, calculate the covariance matrix of a data set.

• Then, calculate the eigenvectors of the covariance matrix.


o The eigenvector having the highest eigenvalue represents the direction in which there is the highest variance.
So this will help in identifying the first principal component.
o The eigenvector having the next highest eigenvalue represents the direction in which data has the highest
remaining variance and also orthogonal to the first direction. So, this helps in identifying the second principal
component.

• Like this, identify the top ‘k’ eigenvectors having top ‘k’ eigenvalues to get the ‘k’ best principal components.

Machine Learning Basics


49
ESPRIT 2024/2025
PCA steps

• Step 1: Standardize the Dataset

• Step 2: Find the Eigenvalues and eigenvectors

• Step 3: Arrange Eigenvalues

• Step 5: Transform Original Dataset

Machine Learning Basics


50
ESPRIT 2024/2025
Example: PCA (1/11)
• Consider the following dataset :

• Step 1: Standardize the Dataset

Machine Learning Basics


51
ESPRIT 2024/2025
Example: PCA (2/11)
• Step 2: Find the Eigenvalues and eigenvectors

where, X is the Dataset Matrix


𝑋 𝑇 is the transpose of the X
N is the number of elements = 10

Machine Learning Basics


52
ESPRIT 2024/2025
Example: PCA (3/11)
• The equation:
𝐶 − 𝜆 × 𝐼 =0, where { λ is the eigenvalue and I is the Identity Matrix }

is used to identify eigen values 𝜆𝑖 .


So solving this equation:
Taking the determinant of the left
side, we get

We get two values for 𝜆, that are 𝝀𝟏 = 1.28403 and 𝝀𝟐 = 0.0490834

Machine Learning Basics


53
ESPRIT 2024/2025
Example: PCA (4/11)
• Now we have to find the eigenvectors for the eigenvalues 𝜆1 and 𝜆2 .
• To find the eigenvectors from the eigenvalues, we will use the
following approach:
o First, we will find the eigenvectors for the eigenvalue 1.28403 by using the
equation 𝐶 × 𝑋 = 𝜆 × 𝑋
Attention:
Here X does not represent the data
set X(slide 41);
It's just the name of the vector
(During the course session, we
named this vector V)

Machine Learning Basics


54
ESPRIT 2024/2025
Example: PCA (5/11)
o Solving the matrices we get:
0.616556x + 0.615444y = 1.28403x ; x = 0.922049 y
(x and y belongs to the matrix X) so if we put y = 1, x comes out to be 0.922049.
So now the updated X matrix will look like:

Machine Learning Basics


55
ESPRIT 2024/2025
Example: PCA (6/11)

Machine Learning Basics


56
ESPRIT 2024/2025
Example: PCA (7/11)
• Secondly, we will find the eigenvectors for the eigenvalue 0.0490834 by using the
equation {Same approach as of previous step)

So we found the eigenvectors for the eigenvector 𝜆2 , they are 0.735176 and
0.677873

Sum of Eigen values 𝜆1 + 𝜆2 =1,33 = Total Variance


{Majority of variance comes from 𝜆1 }

Machine Learning Basics


57
ESPRIT 2024/2025
Example: PCA (8/11)
• Step 3: Arrange Eigenvalues
• 𝜆1 is the eigenvector with the highest eigenvalue is the first Principal
Component of the dataset.

• So in this case:
o Eigenvector of 𝜆1 is the first principal component.

o Eigenvector of 𝜆2 is the second principal component.

Machine Learning Basics


58
ESPRIT 2024/2025
Example: PCA (9/11)
• Step 4: Form Feature Vector

This is the FEATURE VECTOR for Numerical


• First column is the eigenvector of 𝜆1 & second column is the
eigenvector of 𝜆2

Machine Learning Basics


59
ESPRIT 2024/2025
Example: PCA (10/11)
• Step 5: Transform Original Dataset
• Use the equation Z = X V

X V Z

Machine Learning Basics


60
ESPRIT 2024/2025
Example: PCA (11/11)

• Organizing information into principal components in this way will


allow you to reduce dimensionality without losing much information,
by discarding components with little information and considering the
remaining components as your new variables.

Machine Learning Basics


61
ESPRIT 2024/2025
Best Number of Principal Components (1/2)
• Method 1: If your sole intention of doing PCA is for data visualization, you should select 2 or 3
principal components.
• Method 2: Plot the explained variance percentage of individual components and the percentage
of total variance captured by all principal components.

Machine Learning Basics


62
ESPRIT 2024/2025
Best Number of Principal Components (2/2)
• Method 3: Create the scree plot.

• Method 4: Follow the Kaiser’s rule. According to the Kaiser’s rule, it is recommended to keep all
the components with eigenvalues greater than 1.
• Method 5: Use a performance evaluation metric such as RMSE (for Regression) or Accuracy Score
(for Classification).

Machine Learning Basics


63
ESPRIT 2024/2025
Interpretation: correlation circle

Understanding of variables
PCA results Interpretation

Absolute variable position Quality of the representation in the plan

Cosine angle between variables Correlation

● The closer the variables are to the edge of the circle, the better the
variables are represented by the factorial plane, ie the variable is well
correlated with the two factors constituting this plane.
● A point is said to be well represented on an axis or a factorial plane if
it is close to its projection on the axis or the plane. If it is far away, it is
said to be misrepresented.
● → angle formed between the point and its projection on the axis (the
Correlation circle closer it is to 90 degrees, the less the point is well represented)
64
Machine Learning Basics ESPRIT 2024/2025
Interpretation: Individual factor map

Understanding the relationship between variables


PCA Interpretation

Proximities between individuals Similarities


Contribution of an individual to the α axis Contribution to the inertia along the α axis

The proximity in space between two well-represented


individuals translates the resemblance of these two
individuals from the point of view of the values taken by the
variables. (and of course the difference otherwise)

Individual factor map


65
Machine Learning Basics ESPRIT 2024/2025

You might also like